Blog | Zebrium | Ajay Singh

Zebrium RCaaS: A Natural Evolution From Datadog Watchdog Insights Log Anomaly Detection

June 16, 2022 | Ajay Singh

Datadog is one of the most popular observability platforms today, and offers a rich set of capabilities including monitoring, tracing, log management, as well as machine learning (ML) features that help detect outliers. One of its most interesting feature sets falls under the Watchdog umbrella.

Speeding-up Root Cause Analysis with New Relic

May 3, 2022 | Ajay Singh

If you are a New Relic user, you’re likely using New Relic to monitor your environment, detect problems, and troubleshoot them when they occur. But let’s consider exactly what that entails and describe a way to make this entire process much quicker.

Using the Elastic Stack (ELK) For Observability? Here’s How to speed up Troubleshooting

April 8, 2022 | Ajay Singh

Using ELK For Observability? Speed up Troubleshooting with Zebrium

When troubleshooting, the bottleneck isn’t the speed of the Elastic queries – it is the human not knowing exactly what to search for, the time it takes to visually spot outliers and the hunt for bread crumbs that point to suspected problems. Read how ML can do all of this automatically.

The Elastic Stack (often called ELK) is one of the most popular observability platforms in use today. It lets you collect metrics, traces and logs and visualize them in one Kibana dashboard. You can set alerts for outliers, drill-down into your dashboards and search through your logs. But there are limitations. What happens when the symptoms of a problem are obvious – say a big spike in latency, or a sharp drop in network throughput - but the root cause is not as obvious? Usually that means an engineer needs to start scanning logs for unusual events and clusters of errors, to understand what happened. The bottleneck isn’t the speed of the Elastic queries – it is the human not knowing exactly what to search for, the time it takes to visually spot outliers and the hunt for bread crumbs that point to suspected problems.

Root Cause as a Service

March 22, 2022 | Ajay Singh

Now we’re adding an important element to our vision: “…to find the root cause from any problem source and deliver it to wherever it is needed”. So, if an SRE streams logs from hundreds of applications and uses Datadog to monitor them, the root cause found by Zebrium should automatically appear in Datadog dashboards aligned with other metrics charts.

Zebrium was founded with the vision of automatically finding patterns in logs that explain the root cause of software problems. We are well on track to delivering on this vision: we have identified the root cause successfully in over 2,000 incidents across dozens of software stacks, and a study by one of our large customers validated that we do this with 95.8% accuracy (see - How Cisco uses Zebrium ML to Analyze Logs for Root Cause).

Visualizing Root Cause Summaries from Logs

October 26, 2021 | Ajay Singh

Zebrium’s focus is to simplify the jobs of SREs, developers and support engineers who routinely troubleshoot problems using logs. Read about some of the new ways we are making Root Cause Reports more intuitive.

Using ML to Help Engineering and Support Teams Analyze Log Files

October 5, 2021 | Ajay Singh

A new use case has emerged: using our ML to analyze a collection of static logs. This is particularly relevant for technical support teams who collect “bundles” of logs from their customers after a problem has occurred. The results we’re seeing are nothing short of spectacular!

Zebrium’s technology finds the root cause of software problems by using machine learning (ML) to analyze logs. The majority of our customers stream their application and infrastructure logs to our platform for near real-time analysis. However, a new use case has emerged: using our ML to analyze a collection of static logs. This is particularly relevant for technical support teams who collect “bundles” of logs from their customers after a problem has occurred. The results we’re seeing are nothing short of spectacular!

Zebrium’s ML learns the normal patterns of all event types within the logs, and generates root cause reports when clusters of correlated anomalies and errors are detected. This is a great help for SREs, DevOps engineers and developers who are often under pressure to root cause problems as they happen.

However, there are many scenarios where you may not have a continuous stream of logs:

3 ways ML is a Game Changer for your Incident Management Lifecycle

April 9, 2021 | Ajay Singh

Any developer, SRE or DevOps engineer responsible for an application with users has felt the pain of responding to a high priority incident. Read about 3 ways that ML can be a game changer in the incident management lifecycle.

Any developer, SRE or DevOps engineer responsible for an application with users has felt the pain of responding to a high priority incident. There’s the immediate stress of mitigating the issue as quickly as possible, often at odd hours and under severe time pressure. There’s the bigger challenge of identifying root cause so a durable fix can be put in place. There’s the aftermath of postmortems, reviews of your monitoring and observability solutions, and inevitable updates to alert rules. And there’s the typical frustration of wondering what could have been done to avoid the problem in the first place.

In a modern cloud native environment, the complexity of distributed applications and the pace of change make all of this ever harder. Fortunately, AI and ML technologies can help with these human-driven processes. Here are three specific ways:

Real World Examples of GPT-3 Plain Language Root Cause Summaries

March 23, 2021 | Ajay Singh

Zebrium’s unsupervised ML identifies the root cause of incidents and generates concise reports (typically between 5-20 log events). Using GPT-3, these are distilled into simple plain language summaries. This blog presents some real examples of the effectiveness of this approach.

Lessons from Slack, GCP and Snowflake outages

February 4, 2021 | Ajay Singh

An outage at a market leading SaaS company is always noteworthy. Thousands of organizations and millions of users are so reliant on these services that a widespread outage feels as surprising and disruptive as a regional power outage. The crux of the problem is we expect SaaS services to innovate relentlessly. Although they employ some of the best engineers, sophisticated observability strategies and cutting-edge DevOps practices, SaaS companies also have to deal with ever accelerating change and growing complexity.

Beyond Anomaly Detection – How Incident Recognition Drives down MTTR

May 19, 2020 | Ajay Singh

Improved Anomaly Detection: How Incident Recognition Lowers MTTR | Zebrium

Monitoring is about catching unexpected changes in application behavior. Traditional monitoring tools achieve this through alert rules and spotting outliers in dashboards. While this traditional approach can typically catch failure modes obvious service impacting symptoms, it has several limitations.

The State of Monitoring

Monitoring is about catching unexpected changes in application behavior. Traditional monitoring tools achieve this through alert rules and spotting outliers in dashboards. While this traditional approach can typically catch failure modes with obvious service impacting symptoms, it has two limitations:

Anomaly Detection as a foundation of Autonomous Monitoring

April 6, 2020 | Ajay Singh

We believe the future of monitoring, especially for platforms like Kubernetes, is truly autonomous. Cloud native applications are increasingly distributed, evolving faster and failing in new ways, making it harder to monitor, troubleshoot and resolve incidents. Traditional approaches such as dashboards, carefully tuned alert rules and searches through logs are reactive and time intensive, hurting productivity, the user experience and MTTR.

What Is an ML Detected Software Incident?

March 10, 2020 | Ajay Singh

Based on our experience with hundreds of incidents across nearly a hundred unique application stacks, we have developed deep insights into the specific ways modern software breaks. This led us to thoughtfully design a multi-layer machine learning stack that can reliably detect these patterns, and identify the collection of events that describes each incident. In simple terms, here is what we have learned about real-world incidents when software breaks.

Getting anomaly detection right by structuring logs automatically

January 3, 2020 | Ajay Singh

Observability means being able to infer the internal state of your system through knowledge of external outputs. For all but the simplest applications, it’s widely accepted that software observability requires a combination of metrics, traces and events (e.g. logs). As to the last one, a growing chorus of voices strongly advocates for structuring log events upfront. Why? Well, to pick a few reasons - without structure you find yourself dealing with the pain of unwieldy text indexes, fragile and hard to maintain regexes, and reactive searches. You’re also impaired in your ability to understand patterns like multi-line events (e.g. stack traces), or to correlate events with metrics (e.g. by transaction ID).

Do your logs feel like a magic 8 ball?

December 17, 2019 | Ajay Singh

Logs are the source of truth when trying to uncover latent problems in a software system. But is searching logs to find the root cause the right approach?

Logs are the source of truth when trying to uncover latent problems in a software system. They are usually too messy and voluminous to analyze proactively, so they are used mostly for reactive troubleshooting once a problem is known to have occurred.

Using machine learning to shine a light inside the monitoring black box

October 24, 2019 | Ajay Singh

A widely prevalent application monitoring strategy today is sometimes described as “black box” monitoring. Black box monitoring focuses just on externally visible symptoms, including those that approximate the user experience. Black box monitoring is a good way to know when things are broken.

Using ML and logs to catch problems in a distributed Kubernetes deployment

October 3, 2019 | Ajay Singh

It is especially tricky to identify software problems in the kinds of distributed applications typically deployed in k8s environments. There’s usually a mix of home grown, 3^rd party and OSS components – taking more effort to normalize, parse and filter log and metric data into a manageable state. In a more traditional world tailing or grepping logs might have worked to track down problems, but that doesn’t work in a Kubernetes app with a multitude of ephemeral containers. You need to centralize logs, but that comes with its own problems. The sheer volume can bog down the text indexes of traditional logging tools. Centralization also adds confusion by breaking up connected events (such as multi-line stack traces) in interleaved outputs from multiple sources.

Catching Faults Missed by APM and Monitoring tools

August 19, 2019 | Ajay Singh

As software gets more complex, it gets harder to test all possible failure modes within a reasonable time. Monitoring can catch known problems – albeit with pre-defined instrumentation. But it’s hard to catch new (unknown) software problems.

A quick, free and easy way to find anomalies in your logs