On-premises edition now available

Learn More

3 ways ML is a Game Changer for your Incident Management Lifecycle

April 9, 2021 | Ajay Singh

Any developer, SRE or DevOps engineer responsible for an application with users has felt the pain of responding to a high priority incident. Read about 3 ways that ML can be a game changer in the incident management lifecycle.

Any developer, SRE or DevOps engineer responsible for an application with users has felt the pain of responding to a high priority incident. There’s the immediate stress of mitigating the issue as quickly as possible, often at odd hours and under severe time pressure. There’s the bigger challenge of identifying root cause so a durable fix can be put in place. There’s the aftermath of postmortems, reviews of your monitoring and observability solutions, and inevitable updates to alert rules. And there’s the typical frustration of wondering what could have been done to avoid the problem in the first place.

In a modern cloud native environment, the complexity of distributed applications and the pace of change make all of this ever harder. Fortunately, AI and ML technologies can help with these human-driven processes. Here are three specific ways:

Read More

Real World Examples of GPT-3 Plain Language Root Cause Summaries

March 23, 2021 | Ajay Singh

Zebrium’s unsupervised ML identifies the root cause of incidents and generates concise reports (typically between 5-20 log events). Using GPT-3, these are distilled into simple plain language summaries. This blog presents some real examples of the effectiveness of this approach.

 

Read More

Lessons from Slack, GCP and Snowflake outages

February 4, 2021 | Ajay Singh

An outage at a market leading SaaS company is always noteworthy. Thousands of organizations and millions of users are so reliant on these services that a widespread outage feels as surprising and disruptive as a regional power outage. The crux of the problem is we expect SaaS services to innovate relentlessly. Although they employ some of the best engineers, sophisticated observability strategies and cutting-edge DevOps practices, SaaS companies also have to deal with ever accelerating change and growing complexity.

 

 

Read More

This Slack App Speeds-up Incident Resolution Using ML

July 8, 2020 | Ajay Singh

Do you use Slack During Incident Management? This Slack app speeds up resolution by using ML to pull in the likely root cause. This can be used no matter which incident detection tool you use.

If your team (like many others) uses Slack to collaborate during incident management and triage, this new Slack app will be a big-time saver.

Read More

Beyond Anomaly Detection – How Incident Recognition Drives down MTTR

May 19, 2020 | Ajay Singh

Monitoring is about catching unexpected changes in application behavior. Traditional monitoring tools achieve this through alert rules and spotting outliers in dashboards. While this traditional approach can typically catch failure modes obvious service impacting symptoms, it has several limitations.

The State of Monitoring

Monitoring is about catching unexpected changes in application behavior. Traditional monitoring tools achieve this through alert rules and spotting outliers in dashboards. While this traditional approach can typically catch failure modes with obvious service impacting symptoms, it has two limitations:

Read More

Anomaly Detection as a foundation of Autonomous Monitoring

April 6, 2020 | Ajay Singh

We believe the future of monitoring, especially for platforms like Kubernetes, is truly autonomous. Cloud native applications are increasingly distributed, evolving faster and failing in new ways, making it harder to monitor, troubleshoot and resolve incidents. Traditional approaches such as dashboards, carefully tuned alert rules and searches through logs are reactive and time intensive, hurting productivity, the user experience and MTTR. 

We believe the future of monitoring, especially for platforms like Kubernetes, is truly autonomous. Cloud native applications are increasingly distributed, evolving faster and failing in new ways, making it harder to monitor, troubleshoot and resolve incidents. Traditional approaches such as dashboards, carefully tuned alert rules and searches through logs are reactive and time intensive, hurting productivity, the user experience and MTTR. We believe machine learning can do much better – detecting anomalous patterns automatically, creating highly diagnostic incident alerts and shortening time to resolution.

Read More

What Is an ML Detected Software Incident?

March 10, 2020 | Ajay Singh

Based on our experience with hundreds of incidents across nearly a hundred unique application stacks, we have developed deep insights into the specific ways modern software breaks. This led us to thoughtfully design a multi-layer machine learning stack that can reliably detect these patterns, and identify the collection of events that describes each incident. In simple terms, here is what we have learned about real-world incidents when software breaks.

Based on our experience with hundreds of incidents across nearly a hundred unique application stacks, we have developed deep insights into the specific ways modern software breaks. This led us to thoughtfully design a multi-layer machine learning stack that can reliably detect these patterns, and identify the collection of events that describes each incident. In simple terms, here is what we have learned about real-world incidents when software breaks. You can also try it yourself by signing up for a free account.

Read More

Getting anomaly detection right by structuring logs automatically

January 3, 2020 | Ajay Singh

Observability means being able to infer the internal state of your system through knowledge of external outputs. For all but the simplest applications, it’s widely accepted that software observability requires a combination of metrics, traces and events (e.g. logs). As to the last one, a growing chorus of voices strongly advocates for structuring log events upfront.

 

Observability means being able to infer the internal state of your system through knowledge of external outputs. For all but the simplest applications, it’s widely accepted that software observability requires a combination of metrics, traces and events (e.g. logs). As to the last one, a growing chorus of voices strongly advocates for structuring log events upfront. Why? Well, to pick a few reasons - without structure you find yourself dealing with the pain of unwieldy text indexes, fragile and hard to maintain regexes, and reactive searches. You’re also impaired in your ability to understand patterns like multi-line events (e.g. stack traces), or to correlate events with metrics (e.g. by transaction ID).

Read More

Do your logs feel like a magic 8 ball?

December 17, 2019 | Ajay Singh

Logs are the source of truth when trying to uncover latent problems in a software system. But is searching logs to find the root cause the right approach?

Logs are the source of truth when trying to uncover latent problems in a software system. They are usually too messy and voluminous to analyze proactively, so they are used mostly for reactive troubleshooting once a problem is known to have occurred.

Read More

Using machine learning to shine a light inside the monitoring black box

October 24, 2019 | Ajay Singh

A widely prevalent application monitoring strategy today is sometimes described as “black box” monitoring. Black box monitoring focuses just on externally visible symptoms, including those that approximate the user experience. Black box monitoring is a good way to know when things are broken.

Read More

Using ML and logs to catch problems in a distributed Kubernetes deployment

October 3, 2019 | Ajay Singh

It is especially tricky to identify software problems in the kinds of distributed applications typically deployed in k8s environments. There’s usually a mix of home grown, 3rd party and OSS components – taking more effort to normalize, parse and filter log and metric data into a manageable state. In a more traditional world tailing or grepping logs might have worked to track down problems, but that doesn’t work in a Kubernetes app with a multitude of ephemeral containers.

It is especially tricky to identify software problems in the kinds of distributed applications typically deployed in k8s environments. There’s usually a mix of home grown, 3rd party and OSS components – taking more effort to normalize, parse and filter log and metric data into a manageable state. In a more traditional world tailing or grepping logs might have worked to track down problems, but that doesn’t work in a Kubernetes app with a multitude of ephemeral containers. You need to centralize logs, but that comes with its own problems. The sheer volume can bog down the text indexes of traditional logging tools. Centralization also adds confusion by breaking up connected events (such as multi-line stack traces) in interleaved outputs from multiple sources.

Read More

Catching Faults Missed by APM and Monitoring tools

August 19, 2019 | Ajay Singh

As software gets more complex, it gets harder to test all possible failure modes within a reasonable time. Monitoring can catch known problems – albeit with pre-defined instrumentation. But it’s hard to catch new (unknown) software problems. 

A quick, free and easy way to find anomalies in your logs

 

Read More

Featured Posts

FREE SIGN-UP