Zebrium was founded with the vision of automatically finding patterns in logs that explain the root cause of software problems. We are well on track to delivering on this vision: we have identified the root cause successfully in over 2,000 incidents across dozens of software stacks, and a study by one of our large customers validated that we do this with 95.8% accuracy (see - How Cisco uses Zebrium ML to Analyze Logs for Root Cause).
Now we’re adding an important element to our vision: “…to find the root cause from any problem source and deliver it to wherever it is needed”. So, if an SRE streams logs from hundreds of applications and uses Datadog to monitor them, the root cause found by Zebrium should automatically appear in Datadog dashboards aligned with other metrics charts. And if a support team collects a bundle of static log files when a customer hits a problem and tracks the ticket in tool XYZ, then the root cause should automatically appear in XYZ. We call this “Root Cause as a Service”.
In practice, there are a few key attributes to Root Cause as a Service:
- Be able to Ingest data in the most convenient way – stream logs directly via open-source collectors, stream logs from your existing observability pipeline (e.g. a forked stream from your ELK Stack), or upload log files/bundles using APIs.
- Present the root cause where it is most intuitive and can best help a user resolve the issue.
- Have a collection of logs files related to a bug? Just upload them via CLI/APIs, and immediately get a report summarizing the log events that describe the root cause.
- See a spike in your time series dashboard and don’t see an obvious root cause? No need to open your log manager console and start a potentially painful hunt. See the root cause automatically on the same dashboard – lined up with the spike in your chart.
- Have an incident opened in PagerDuty that contains alerts and engineer’s notes? Get the ML generated report automatically added to the notes, and even increment the Jira ticket if the problem fingerprint matches a known issue.
- Offer simple mechanisms to take alerts or events from other observability and incident management tools, and augment them with root cause details found by Zebrium’s ML.
- No customization, rules or training data sets needed. Developers, SREs and DevOps engineers don’t have time for tools that take days or weeks of training or customization to be useful. The RCaaS process just works and quickly (< 24 hours) achieves accuracy.
This 9 minute video describes how the ML achieves the above, and metrics corroborating the accuracy of the results a large organization has experienced in using the ML.
Here’s how all of this took shape.
We started off thinking we had to deliver root cause details as part of an observability platform in order to gain customer adoption. So, we built an aggregated log viewer, added basic log search / alerting capabilities and penciled in a roadmap to enrich these capabilities. But soon after entering the market, we realized being another observability platform with a broad feature set was actually an impediment. Most customers already had a log manager. And, if they didn’t, they always wanted features we hadn’t yet built. But more importantly, the pain point customers had was not that their log managers weren’t good enough, rather the pain was specifically around how hard it was to hunt for the root cause of problems. So we narrowed our focus to the pain point itself which was exactly mapped to what our core machine learning (ML) was built for: to automatically find the patterns in logs that indicate the root cause. And we stopped investing in observability and logging capabilities.
Fitting into a user’s workflow
Zebrium’s ML continuously scans log streams from any SaaS/software environment (it can do the same for static log files/bundles). The ML classifies all the log events seen in the log streams, detects anomalies/errors, correlates them across streams to pick out clusters that are unlikely to be “normal”, and summarizes them in natural language.
Gen 1: Simple Alerts
Our first attempt to make our ML fit this into an engineering/SRE team’s workflow was to send alerts as we discovered these clusters, via Slack, email or webhooks. Some teams loved this approach, because it proactively detected and root caused issues that were causing headaches.
But other users felt this was insufficient. For one thing these alerts did not tie into their existing incident management workflows, whether those were in Slack, or tools like PagerDuty, OpsGenie and VictorOps. So it took some effort to distinguish between Zebrium RCAs that matched an incident already created by other observability tools and Zebrium incidents that surfaced new/unknown issues not detected by any other method.
Gen 2: Add RCA report into the timeline of an existing incident
Our next refinement was to add support for triggering our RCA engine via inbound signals from incident management (or other) tools, and respond with a well-structured root cause report payload that could be added into the timeline of an existing incident.
This is what it looks with PagerDuty (deeper description can be found here):
Similar integrations quickly followed for Slack, OpsGenie, VictorOps and other tools. This approach was quickly adopted by teams with mature “detection” capabilities in their observability stacks, and well-defined incident management processes.
But it still had a weakness. The actual troubleshooting (i.e. the work performed by engineers, SREs, DevOps and support teams) was mostly done outside of the incident management tool. In other words, the people doing the actual troubleshooting spent a lot of time in their monitoring, tracing and log management tools. So, for example, if an engineer saw a spike in a metrics or trace dashboard, it was hard to correlate that with the details that Zebrium had uncovered.
Now, imagine if the root cause just appeared wherever the enginner was looking? This led us to our most ambitious expansion of “Root Cause as a Service”.
Gen 3 – Integrate directly into any observability tool
We are now quickly rolling out a slew of integrations that surface Zebrium’s detections right within the most commonly used observability tools – starting with AppDynamics, DataDog, Grafana, New Relic, Dynatrace and the Elastic Stack (and more on the way). See what we mean by looking at this 30 sec video, or this blog.
The chart above shows what this looks like in Datadog. The first three metric charts show an obviously correlated dip in cpu usage and network traffic. But what caused it? See the red bar in the fourth chart (Zebrium Root Cause Finder)? This is where Zebrium’s machine learning detected a problem, and it perfectly lines up with the dips! Clicking the red bar shows the root cause:
This is what we mean by Root Cause as a Service: the root cause just shows up where you need it. No hunting, no searching, no flipping between tools!
Try this for yourself by visiting www.zebrium.com.