Datadog is one of the most popular observability platforms today, and offers a rich set of capabilities including monitoring, tracing, log management, as well as machine learning (ML) features that help detect outliers. One of its most interesting feature sets falls under the Watchdog umbrella.
Watchdog automatically detects outliers in metrics or traces, so if your application is properly instrumented and you encounter one of the known problems, then it can quickly show you the root cause. Today the set of failures that can be auto-identified cover the following:
Datadog is working to add support for more failure modes in future updates, including CPU saturation and memory leaks.
While this capability is promising, there are some limitations. The first is that every part of your application stack will likely not be instrumented to collect all the right events and traces. The bigger limitation is that any cloud native application will encounter many new and unique problems on a regular basis, particularly software bugs. Most modern applications are composed of dozens of microservices, and experience changes (new versions) on a daily basis.
As a result, there will be a never-ending set of new failure modes. An approach that relies on matching fingerprints covering a small set of known failures will not keep up or help you troubleshoot the majority of new incidents. Our team has accumulated decades of experience building industry leading rules-based monitoring solutions (that identify root cause, not just symptoms), and know first-hand that the burden of building and maintaining rules to identify root cause is becoming increasingly impractical as applications become more complex and the rate of new updates grows faster.
Perhaps recognizing this limitation, Datadog recently introduced a new capability called Watchdog Insights Log Anomaly Detection, which applies log anomaly detection to automatically identify events of interest. Specifically, it tries to surface events or sources which account for an unusually high percentage of errors, to help an engineer troubleshoot a problem faster. This can work in two ways:
Watchdog Log Anomaly detection appears to be an intuitive feature, and feels like a natural extension to DataDog’s strong monitoring capabilities. Particularly when dealing with web-apps, it can quickly help you narrow down which endpoints, hosts (or other fields) are outliers in generating disproportionate errors and speed-up resolution of these problems.
While useful in certain scenarios, Watchdog Root Cause Analysis and Watchdog Log anomaly detection don’t quite get you to the ultimate nirvana – automated root cause analysis for any type of problem. Here are some key attributes of the ideal solution that are not covered:
Zebrium has spent many years studying hundreds of production incidents in order to understand and characterize how humans find root cause indicators in logs (details can be seen in this video). Our Root Cause as a Service (RCaaS) platform is modeled on these learnings.
RCaaS automatically finds the best possible root cause indicators in the logs for any kind of software or infrastructure problem. In simple terms, it finds the same log lines a human would have eventually found by manually digging through the logs, but it does this in-real time and at any scale. The results are delivered as a root cause report containing an English language summary (leveraging the GPT-3 language model), a Word Cloud that brings attention to important words in the log lines, and a small sequence of log lines (typically 10-60) that highlight the root cause and symptoms. RCaaS does not require pre-built rules or manual training.
A recent third-party study using 192 actual customer incidents from four product lines showed that Zebrium automatically found the correct root cause indicators in over 95% of incidents. Full details of this study can be found here: https://www.zebrium.com/cisco-validation.
The best part about RCaaS is that it works natively with Datadog and complements the entire Datadog feature set. Through either a native Datadog app widget or an events and metrics integration (see here), when a problem occurs, you will automatically see details of the root cause show up on any Datadog dashboard.
As mentioned above, we tested two failure scenarios using an open source microservices online shopping application (Sock Shop), and the CNCF open source Chaos tool called Litmus. We let both Datadog and RCaaS receive logs from the application for several hours before running the tests to break the application environment. Aside from this short “warm-up” period, this was a greenfield test - neither Zebrium’s RCaaS, nor Datadog’s Watchdog Log Anomaly Detection were customized or trained in ANY way whatsoever about this application, the chaos tool, or the specific failures introduced.
Test 1: Network Corruption using Litmus Chaos Engine
As soon as the Litmus Chaos Engine introduced network corruption in the Kubernetes environment, Datadog’s K8s dashboard showed the symptoms. And, just below on the same dashboard (using Zebrium’s integration with Datadog), the Zebrium RCaaS panel immediately showed a report that lined up perfectly with the drops in CPU and Network traffic. The report highlighted the correct root cause indicators and the natural language summary provided a strong clue as to the root cause. Note keywords such as “pod-network-corruption”, “misconfig” and “tx_queue_len” – all related to the failure mode.
By clicking the link on the Datadog dashboard, the full RCaaS report can be seen in the Zebrium UI. The report concisely shows the chaosengine starting the network corruption test and then a kernel message indicating a misconfig on eth0. This is followed by the symptoms (error in the orders and frontend services).
You can watch a 2-minute video of this here.
By comparison, Datadog’s Log Anomaly Detection picked out one event pattern in that time period – which turned out to be unrelated to the root cause or symptoms of the problem.
Test 2: Memory Starvation
In this scenario, we used the same Sock Shop application, but this time ran a script that filled up memory on the node. Once again, the Datadog dashboard immediately picked up the symptoms. Although this scenario could be root caused without using the logs (e.g. by a user carefully drilling down into metrics and traces for instance), for this test we were trying to assess if autonomous log based features could catch the problem without any training or manual drill-down.
Once again, on the Datadog dashboard, immediately under the symptoms, we saw a Zebrium RCaaS report that lined up perfectly with the spikes in memory. The report pointed at the root cause keywords such as “oom_test” and “kernel”.
The full root cause report in the Zebrium UI once again is near perfect, identifying the oom_killer event and oom_test events that make it clear this is a memory starvation problem induced by an application called “oom_test”.
In comparison, Datadog’s Log Anomaly Detection did not pick up the root cause log events – in this case it did not pick up any anomalies in the relevant time period.
Also, once again, the key log events that should have been picked up were received by Datadog – they just weren’t being identified as root cause indicators.
We see Zebrium’s Root Cause as a Service as a natural evolution from and complement to Datadog Watchdog Insights Log Anomaly Detection. Datadog’s Log Anomaly Detection is useful for specific types of anomalies, such as misbehaving endpoints in a webapp. Zebrium’s RCaaS is much wider in applicability, and doesn’t attempt to just find anomalies in the logs, it finds correlated patterns of anomalies that provide concise root cause reports with when a problem occurs. The Zebrium technology is relied upon by many industry leaders and has been third-party validated to correctly find the root cause in over 95% of incidents.
It’s natural to be skeptical about these claims, and so we encourage you try RCaaS in your own environment. It integrates natively with Datadog using either an events and metrics integration or a native Datadog app widget. In this way you can automatically see the root cause of an incident on any of your existing Datadog dashboards.
You can find details on Zebrium’s integration with Datadog here and can sign-up for a free trial or SaaS license in the Datadog Marketplace or on https://www.zebrium.com.