On-premises edition now available

Learn More

Zebrium Blog

How to Try Zebrium ML-based RCA Using a Realistic Demo App

October 12, 2021 | Gavin Cohen

Zebrium uses machine learning on logs to automatically find the root cause of software problems. The best way to see it in action is with an application that is experiencing a failure. This blog shows you how to spin up a realistic demo app called Sock Shop, break the app using a chaos tool, and then see how Zebrium automatically finds the root cause. 

The best way to try Zebrium's machine learning is with an application that is experiencing a failure. This blog shows you how to spin up a single node Kubernetes cluster using minikube, install a realistic demo app (Sock Shop), break the app using CNCF's Litmus Chaos engineering tool, and then see how Zebrium automatically finds the root cause of the problem. 

Read More

Using ML to Help Engineering and Support Teams Analyze Log Files

October 5, 2021 | Ajay Singh

A new use case has emerged: using our ML to analyze a collection of static logs. This is particularly relevant for technical support teams who collect “bundles” of logs from their customers after a problem has occurred. The results we’re seeing are nothing short of spectacular!


Zebrium’s technology finds the root cause of software problems by using machine learning (ML) to analyze logs. The majority of our customers stream their application and infrastructure logs to our platform for near real-time analysis. However, a new use case has emerged: using our ML to analyze a collection of static logs. This is particularly relevant for technical support teams who collect “bundles” of logs from their customers after a problem has occurred. The results we’re seeing are nothing short of spectacular!

Zebrium’s ML learns the normal patterns of all event types within the logs, and generates root cause reports when clusters of correlated anomalies and errors are detected. This is a great help for SREs, DevOps engineers and developers who are often under pressure to root cause problems as they happen.

However, there are many scenarios where you may not have a continuous stream of logs:

Read More

Log Anomaly Detection Using Machine Learning

June 21, 2021 | Larry Lancaster
At Zebrium, we have a saying: “Structure First”. We talk a lot about structuring because it allows us to do amazing things with log data. But most people don’t know what we mean when we say the “structure”, or why it is a necessity for accurate log anomaly detection.

At Zebrium, we have a saying: “Structure First”. We talk a lot about structuring because it allows us to do amazing things with log data. But most people don’t know what we mean when we say the “structure”, or why it is a necessity for accurate log anomaly detection.

Read More

Elasticsearch Machine Learning -An Improved Approach Using Correlated Anomaly Detection To Find Root Cause

June 2, 2021 | Gavin Cohen

Native machine learning for ElasticSearch was first introduced as an Elastic Stack (ELK Stack) feature in 2017. It came from Elastic's acquisition of Prelert, and was designed for anomaly detection in time series metrics data. The Elastic ML technology has since evolved to include anomaly detection for log data. So why is a new approach needed for Elastic Stack machine learning?

Native machine learning for ElasticSearch was first introduced as an Elastic Stack (ELK Stack) feature in 2017. It came from Elastic's acquisition of Prelert, and was designed for anomaly detection in time series metrics data. The Elastic ML technology has since evolved to include anomaly detection for log data. So why is a new approach needed for Elastic Stack machine learning?

Read More

Log Analysis with Machine Learning: An Automated Approach to Analyzing Logs Using ML/AI

May 20, 2021 | David Gildeh

When a new/unknown software problem occurs, chances are an SRE or developer will start by analyzing and searching through logs for root cause - a slow and painful process. So it's no wonder using machine learning (ML) for log analysis is getting a lot of attention. 

When a new/unknown software problem occurs, chances are an SRE or developer will start by analyzing and searching through logs for root cause - a slow and painful process. So it's no wonder using machine learning (ML) for log analysis is getting a lot of attention. 

 

Read More

What if RCA was done for you in Opsgenie?

May 3, 2021 | Gavin Cohen

We all know the drill. Sun, warm water, tranquility, silence, zzzz... Then your phone blares and buzzes, violently waking you from sleep. It’s dark and you quickly leave the dream behind. With blurry eyes you read, “AppDynamics Alert: shopping_cart_checkout”. Damn, that’s the service that was upgraded this afternoon.

We all know the drill. Sun, warm water, tranquility, silence, zzzz... Then your phone blares and buzzes, violently waking you from sleep. It’s dark and you quickly leave the dream behind. With blurry eyes you read, “AppDynamics Alert: latency threshold exceeded on page: shopping_cart_checkout”. Damn, that’s the

Read More

3 ways ML is a Game Changer for your Incident Management Lifecycle

April 9, 2021 | Ajay Singh

Any developer, SRE or DevOps engineer responsible for an application with users has felt the pain of responding to a high priority incident. Read about 3 ways that ML can be a game changer in the incident management lifecycle.

Any developer, SRE or DevOps engineer responsible for an application with users has felt the pain of responding to a high priority incident. There’s the immediate stress of mitigating the issue as quickly as possible, often at odd hours and under severe time pressure. There’s the bigger challenge of identifying root cause so a durable fix can be put in place. There’s the aftermath of postmortems, reviews of your monitoring and observability solutions, and inevitable updates to alert rules. And there’s the typical frustration of wondering what could have been done to avoid the problem in the first place.

In a modern cloud native environment, the complexity of distributed applications and the pace of change make all of this ever harder. Fortunately, AI and ML technologies can help with these human-driven processes. Here are three specific ways:

Read More

Real World Examples of GPT-3 Plain Language Root Cause Summaries

March 23, 2021 | Ajay Singh

Zebrium’s unsupervised ML identifies the root cause of incidents and generates concise reports (typically between 5-20 log events). Using GPT-3, these are distilled into simple plain language summaries. This blog presents some real examples of the effectiveness of this approach.

 

Read More

The Root Cause Experience

February 22, 2021 | Roy Selig

More than two years ago, the Zebrium UI team embarked on designing an experience around our Machine Learning (ML) engine, which was built to find the root cause of critical issues in logs so users wouldn't have to ... and the root cause experience was born!

More than two years ago, the Zebrium UI team embarked on designing an experience around our Machine Learning (ML) engine, which was built to find the root cause of critical issues in logs so users wouldn't have to. We knew our approach could cut Mean Time to Resolution (MTTR) from hours to minutes. But the idea of ML was foreign to users coming from the logging world, people who often relied on clever manual searching to do their jobs. To make them comfortable, our UI was built as a log viewer with ML features added around it.  We had good success with that approach. But the utility of our UI was often measured by the quality of our search capability, not the quality of our ML. It wasn't until we focused the UI on what we do best, root cause detection through ML, that our users fully understood our value proposition... and the root cause experience was born!

Read More

Lessons from Slack, GCP and Snowflake outages

February 4, 2021 | Ajay Singh

An outage at a market leading SaaS company is always noteworthy. Thousands of organizations and millions of users are so reliant on these services that a widespread outage feels as surprising and disruptive as a regional power outage. The crux of the problem is we expect SaaS services to innovate relentlessly. Although they employ some of the best engineers, sophisticated observability strategies and cutting-edge DevOps practices, SaaS companies also have to deal with ever accelerating change and growing complexity.

 

 

Read More

Using GPT-3 for plain language incident root cause from logs

January 9, 2021 | Larry Lancaster

This project is a favorite of mine and so I wanted to share a glimpse of what we've been up to with OpenAI's amazing GPT-3 language model. Today I'll be sharing a couple of straightforward results.

 

Plain Language root cause summaries. Try it for free!
GET STARTED FREE

 

This project is a favorite of mine and so I wanted to share a glimpse of what we've been up to with OpenAI's amazing GPT-3 language model. Today I'll be sharing a couple of straightforward results. There are more advanced avenues we're exploring for our use of GPT-3, such as fine-tuning (custom pre-training for specific datasets); you'll hear none of that today, but if you're interested in this topic, follow this blog for updates.

 

You can also see some real-world results from our customer base here.

Read More

Try ML-driven RCA using a cloud-native microservices demo app

December 16, 2020 | Gavin Cohen

There is no better way to try Zebrium ML incident and root cause detection than with a production application that is experiencing a problem. The machine learning will not only detect the problem, but also show its root cause. But no user wants to induce a problem in their app just to experience the magic of our technology! So, although it's second best, an alternative is to try Zebrium with a sample real-life application, break the app and then see what Zebrium detects.

*** A new version of this blog that uses a more realistic way to inject an error can be found here. ***

 

 

There is no better way to try ML-driven root cause analysis than with a production application that is experiencing a problem. The machine learning will not only detect the problem, but also show its root cause. But no user wants to induce a problem in their app just to experience the magic of our technology! So, although it's second best, an alternative is to try Zebrium with a sample real-life application, break the app and then see what Zebrium detects. One of our customers kindly introduced us to Google's microservices demo app - Online Boutique

 

Read More

ZELK vs ELK: Zebrium vs Elastic Machine Learning

October 25, 2020 | Gavin Cohen

We often get asked how Zebrium ZELK Stack machine learning (ML) compares to native ML for Elasticsearch. The easiest way to answer this is to see the two technologies side by side. No manual training, rules or special configuration were used for either ZELK or ELK.

We often get asked how Zebrium ZELK Stack machine learning (ML) compares to native ML for Elasticsearch. The easiest way to answer this is to see the two technologies side by side. This short (3 minute) video demonstrates what each solution is able to uncover from the exact same log data. No manual training, rules or special configuration were used for either ZELK or ELK.

Read More

Zebrium Named a 2020 Gartner Cool Vendor

October 22, 2020 | Gavin Cohen

Zebrium has been recognized by Gartner as one of four vendors in the report, "Cool Vendors in Performance Analysis", by Padraig Byrne, Federico De Silva, Pankaj Prasad, Venkat Rayapudi & Gregg Siegfried, October 5 2020. 

The past three months has seen Zebrium reach several major milestones! We moved from beta to production and our platform is now in use by industry leading customers who rely on Zebrium to keep their production applications running. We were named in the Forbes AI50 list as one of "America’s Most Promising Artificial Intelligence Companies". We were written up in DZone as one of the "7 Best Log Management Tools for Kubernetes". We added the capability to augment other logging and monitoring tools, and we recently released ZELK Stack - software incident and root cause detection for Elastic Stack (ELK Stack).

Read More

Virtual tracing: A simpler alternative to distributed tracing for troubleshooting

July 21, 2020 | Larry Lancaster

Distributed tracing is commonly used in Application Performance Monitoring (APM) to monitor and manage application performance, giving a view into what parts of a transaction call chain are slowest. It is a powerful tool for monitoring call completion times and examining particular requests and transactions.

The promise of tracing

Distributed tracing is commonly used in Application Performance Monitoring (APM) to monitor and manage application performance, giving a view into what parts of a transaction call chain are slowest. It is a powerful tool for monitoring call completion times and examining particular requests and transactions.

Read More

Zebrium can Augment PagerDuty Incidents

July 17, 2020 | Rod Bagg

PagerDuty is a leader in Incident Response and on-call Escalation Management. There are over 300 integrations for PagerDuty to analyze digital signals from virtually any software-enabled system to detect and pinpoint issues across your ecosystem. When an Incident is created, PagerDuty will mobilize the right team in seconds (read: this is when you get the "page" during your daughter's 5th birthday party).

I wanted to give you an update on my last blog on MTTR by showing you our PagerDuty Integration in action.

As I said before, you probably care a lot about Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR). You're also no doubt familiar with monitoring, incident response, war rooms and the like. Who of us hasn't been ripped out of bed or torn away from family or friends at the most inopportune times? I know firsthand from running world-class support and SRE organizations that it all boils down to two simple things: 1) all software systems have bugs and 2) It's all about how you respond. While some customers may be sympathetic to #1, without exception, all of them still expect early detection, acknowledgment of the issue, and near-immediate resolution. Oh, and it better not ever happen again!

Read More

This Slack App Speeds-up Incident Resolution Using ML

July 8, 2020 | Ajay Singh

Do you use Slack During Incident Management? This Slack app speeds up resolution by using ML to pull in the likely root cause. This can be used no matter which incident detection tool you use.

If your team (like many others) uses Slack to collaborate during incident management and triage, this new Slack app will be a big-time saver.

Read More

You've Nailed Incident detection, what about Incident Resolution?

June 3, 2020 | Rod Bagg

You probably care a lot about Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR). You're also no doubt familiar with monitoring, incident response, war rooms and the like. Imagine if you could automatically augment an incident detected by any monitoring, APM or tracing tool with details of root cause?

If you're reading this, you probably care a lot about Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR). You're also no doubt familiar with monitoring, incident response, war rooms and the like. Who of us hasn't been ripped out of bed or torn away from family or friends at the most inopportune times? I know firsthand from running world-class support and SRE organizations that it all boils down to two simple things: 1) all software systems have bugs and 2) It's all about how you respond. While some customers may be sympathetic to #1, without exception, all of them still expect early detection, acknowledgement of the issue and near-immediate resolution. Oh, and it better not ever happen again!

Read More

Is Log Management Still the Best Approach?

May 29, 2020 | Gavin Cohen

Part of our product does what most log managers do: aggregates logs, makes them searchable, allows filtering, provides easy navigation and lets you build alert rules. So why write this blog? Because in today’s cloud native world, while useful, log managers can be a time sink when it comes to detecting and tracking down the root cause of software incidents.

Disclosure – I work for Zebrium. Part of our product does what most log managers do: aggregates logs, makes them searchable, allows filtering, provides easy navigation and lets you build alert rules. So why write this blog? Because in today’s cloud native world (microservices, Kubernetes, distributed apps, rapid deployment, testing in production, etc.) while useful, log managers can be a time sink when it comes to detecting and tracking down the root cause of software incidents.

Read More

Beyond Anomaly Detection – How Incident Recognition Drives down MTTR

May 19, 2020 | Ajay Singh

Monitoring is about catching unexpected changes in application behavior. Traditional monitoring tools achieve this through alert rules and spotting outliers in dashboards. While this traditional approach can typically catch failure modes obvious service impacting symptoms, it has several limitations.

The State of Monitoring

Monitoring is about catching unexpected changes in application behavior. Traditional monitoring tools achieve this through alert rules and spotting outliers in dashboards. While this traditional approach can typically catch failure modes with obvious service impacting symptoms, it has two limitations:

Read More

Busting the Browser's Cache

A new release of your web service has just rolled out with some awesome new features and countless bug fixes. A few days later and you get a call: Why am I not seeing my what-ch-ma-call-it on my thing-a-ma-gig? After setting up that zoom call it is clear that the browser has cached old code, so you ask the person to hard reload the page with Ctrl-F5. Unless its a Mac in which case you need Command-Shift-R. And with IE you have to click on Refresh with Shift. You need to do this on the other page as well. Meet the browser cache, the bane of web service developers!

 

Read More

Is Autonomous monitoring the anomaly detection you actually wanted?

April 15, 2020 | Larry Lancaster

Automatically Spot Critical Incidents and Show Me Root Cause

That's what I wanted from a tool when I first heard of anomaly detection. I wanted it to do this based only on the logs and metrics it ingests, and alert me right away, with all this context baked in...

Automatically Spot Critical Incidents and Show Me Root Cause

Read More

Anomaly Detection as a foundation of Autonomous Monitoring

April 6, 2020 | Ajay Singh

We believe the future of monitoring, especially for platforms like Kubernetes, is truly autonomous. Cloud native applications are increasingly distributed, evolving faster and failing in new ways, making it harder to monitor, troubleshoot and resolve incidents. Traditional approaches such as dashboards, carefully tuned alert rules and searches through logs are reactive and time intensive, hurting productivity, the user experience and MTTR. 

We believe the future of monitoring, especially for platforms like Kubernetes, is truly autonomous. Cloud native applications are increasingly distributed, evolving faster and failing in new ways, making it harder to monitor, troubleshoot and resolve incidents. Traditional approaches such as dashboards, carefully tuned alert rules and searches through logs are reactive and time intensive, hurting productivity, the user experience and MTTR. We believe machine learning can do much better – detecting anomalous patterns automatically, creating highly diagnostic incident alerts and shortening time to resolution.

Read More

A Prometheus fork for cloud scale anomaly detection across metrics & logs

March 23, 2020 | Anil Nanduri

At Zebrium, we provide an Autonomous Monitoring service that automatically detects anomalies within logs and metrics. We started by correlating anomalies across log streams to automatically raise incidents that require our user's attention. Now, we have taken the step to augment our incident detection by detecting anomalies within a group of related metrics and correlate those with anomalies found in logs.

Introduction

At Zebrium, we provide an Autonomous Monitoring service that automatically detects anomalies within logs and metrics. We started by correlating anomalies across log streams to automatically raise incidents that require our user's attention. Now, we have taken the step to augment our incident detection by detecting anomalies within a group of related metrics and correlate those with anomalies found in logs.

Read More

What Is an ML Detected Software Incident?

March 10, 2020 | Ajay Singh

Based on our experience with hundreds of incidents across nearly a hundred unique application stacks, we have developed deep insights into the specific ways modern software breaks. This led us to thoughtfully design a multi-layer machine learning stack that can reliably detect these patterns, and identify the collection of events that describes each incident. In simple terms, here is what we have learned about real-world incidents when software breaks.

Based on our experience with hundreds of incidents across nearly a hundred unique application stacks, we have developed deep insights into the specific ways modern software breaks. This led us to thoughtfully design a multi-layer machine learning stack that can reliably detect these patterns, and identify the collection of events that describes each incident. In simple terms, here is what we have learned about real-world incidents when software breaks. You can also try it yourself by signing up for a free account.

Read More

Using Autonomous Monitoring with Litmus Chaos Engine on Kubernetes

March 6, 2020 | David Gildeh

A few months ago, our friends at Maya Data joined our private Beta to give our Autonomous Log Monitoring platform a test run. During the test, they used their newly created Litmus Chaos Engine to generate issues in their Kubernetes cluster, and we managed to detect all of them successfully using our Machine Learning completely unsupervised. Needless to say, they were impressed!

Read More

Single Sign-On with OAuth

March 6, 2020 | Alan Jones

There is a lot of material on OAuth (OAuth2, OpenID Connect, and friends) but not much focused on the application of OAuth to the problem of Single Sign-On (SSO). This post seeks to do so.

Read More

The Future of Monitoring is Autonomous

March 1, 2020 | David Gildeh

Monitoring today is extremely human driven. The only thing we’ve automated with monitoring to date is the ability to watch for metrics and events that send us alerts when something goes wrong. Everything else: deploying collectors, building parsing rules, configuring dashboards and alerts, and troubleshooting and resolving incidents, requires a lot of manual effort from expert operators that intuitively know and understand the system being monitored.

TL;DR

Monitoring today puts far too much burden on DevOps and developers. These teams spend countless hours staring at dashboards, hunting through logs, and maintaining fragile alert rules. Fortunately, unsupervised machine learning can be applied to logs and metrics to autonomously detect and find the root cause of critical incidents. Read more below, or start using our Autonomous Monitoring Platform for free - it takes less than 2 minutes to get started.  

Introduction

Monitoring today is extremely human driven. The only thing we’ve automated with monitoring to date is the ability to alert on rules that watch for specific metrics and events that occur when something known goes wrong. Everything else - building parsing rules, configuring and maintaining dashboards and alerts, and troubleshooting incidents - requires a lot of manual effort from expert operators that intuitively know and understand the system being monitored.

Read More

Designing a RESTful API Framework

February 6, 2020 | Alan Jones

As the principal responsible for the design of middleware software at Zebrium, I’m writing to share some of the choices we made and how they have held up. Middleware in this context means the business-logic that sits between persistent storage and a web-based user interface.

As the principal responsible for the design of middleware software at Zebrium, I’m writing to share some of the choices we made and how they have held up. Middleware in this context means the business-logic that sits between persistent storage and a web-based user interface.

Read More

How Fluentd collects Kubernetes metadata

January 30, 2020 | Brady Zuo

As part of my job, I recently had to modify Fluentd to be able to stream logs to our (Zebrium) Autonomous Log Monitoring platform. In order to do this, I needed to first understand how Fluentd collected Kubernetes metadata. I thought that what I learned might be useful/interesting to others and so decided to write this blog.

As part of my job, I recently had to modify Fluentd to be able to stream logs to our (Zebrium) Autonomous Log Monitoring platform. In order to do this, I needed to first understand how Fluentd collected Kubernetes metadata. I thought that what I learned might be useful/interesting to others and so decided to write this blog.

Read More

FREE SIGN-UP