Zebrium Blog

Goodbye, Anomaly Detection - Hello, Incident Detection

The eyeball has become the bottleneck in monitoring. New kinds of incidents ("Unknown Unknowns") require a person to dig in and identify them. With application complexity continuing to grow, we fight a losing battle for resolution time. Autonomous Monitoring is the only way out: software that autonomously detects and then summarizes an incident, with root cause indicators, and presents that information to a human.

Read More

Anomaly Detection as a foundation of Autonomous Monitoring

We believe the future of monitoring, especially for platforms like Kubernetes, is truly autonomous. Cloud native applications are increasingly distributed, evolving faster and failing in new ways, making it harder to monitor, troubleshoot and resolve incidents. Traditional approaches such as dashboards, carefully tuned alert rules and searches through logs are reactive and time intensive, hurting productivity, the user experience and MTTR. 

We believe the future of monitoring, especially for platforms like Kubernetes, is truly autonomous. Cloud native applications are increasingly distributed, evolving faster and failing in new ways, making it harder to monitor, troubleshoot and resolve incidents. Traditional approaches such as dashboards, carefully tuned alert rules and searches through logs are reactive and time intensive, hurting productivity, the user experience and MTTR. We believe machine learning can do much better – detecting anomalous patterns automatically, creating highly diagnostic incident alerts and shortening time to resolution.

Read More

A Prometheus fork for cloud scale anomaly detection across metrics & logs

At Zebrium, we provide an Autonomous Monitoring service that automatically detects anomalies within logs and metrics. We started by correlating anomalies across log streams to automatically raise incidents that require our user's attention. Now, we have taken the step to augment our incident detection by detecting anomalies within a group of related metrics and correlate those with anomalies found in logs.

Introduction

At Zebrium, we provide an Autonomous Monitoring service that automatically detects anomalies within logs and metrics. We started by correlating anomalies across log streams to automatically raise incidents that require our user's attention. Now, we have taken the step to augment our incident detection by detecting anomalies within a group of related metrics and correlate those with anomalies found in logs.

Read More

What Is an ML Detected Software Incident?

Based on our experience with hundreds of incidents across nearly a hundred unique application stacks, we have developed deep insights into the specific ways modern software breaks. This led us to thoughtfully design a multi-layer machine learning stack that can reliably detect these patterns, and identify the collection of events that describes each incident. In simple terms, here is what we have learned about real-world incidents when software breaks.

Based on our experience with hundreds of incidents across nearly a hundred unique application stacks, we have developed deep insights into the specific ways modern software breaks. This led us to thoughtfully design a multi-layer machine learning stack that can reliably detect these patterns, and identify the collection of events that describes each incident. In simple terms, here is what we have learned about real-world incidents when software breaks. You can also try it yourself by signing up for a free account.

Read More

Using Autonomous Monitoring with Litmus Chaos Engine on Kubernetes

A few months ago, our friends at Maya Data joined our private Beta to give our Autonomous Log Monitoring platform a test run. During the test, they used their newly created Litmus Chaos Engine to generate issues in their Kubernetes cluster, and we managed to detect all of them successfully using our Machine Learning completely unsupervised. Needless to say, they were impressed!

Read More

The Future of Monitoring is Autonomous

Monitoring today is extremely human driven. The only thing we’ve automated with monitoring to date is the ability to watch for metrics and events that send us alerts when something goes wrong. Everything else: deploying collectors, building parsing rules, configuring dashboards and alerts, and troubleshooting and resolving incidents, requires a lot of manual effort from expert operators that intuitively know and understand the system being monitored.

TL;DR

Monitoring today puts far too much burden on DevOps - requiring too much time and skill to build and maintain complex dashboards and fragile alert rules. Fortunately unassisted machine learning can be used to autonomously detect and find the root cause of critical incidents. Please read about it below, or try our free Autonomous Monitoring Platform now and start auto-detecting incidents in less than 5 minutes.

 

The Longer Version

 

Monitoring today is extremely human driven. The only thing we’ve automated with monitoring to date is the ability to watch for metrics and events that send us alerts when something goes wrong. Everything else: deploying collectors, building parsing rules, configuring dashboards and alerts, and troubleshooting and resolving incidents, requires a lot of manual effort from expert operators that intuitively know and understand the system being monitored.

Read More

Designing a RESTful API Framework

As the principal responsible for the design of middleware software at Zebrium, I’m writing to share some of the choices we made and how they have held up. Middleware in this context means the business-logic that sits between persistent storage and a web-based user interface.

As the principal responsible for the design of middleware software at Zebrium, I’m writing to share some of the choices we made and how they have held up. Middleware in this context means the business-logic that sits between persistent storage and a web-based user interface.

Read More

How Fluentd collects Kubernetes metadata

As part of my job, I recently had to modify Fluentd to be able to stream logs to our (Zebrium) Autonomous Log Monitoring platform. In order to do this, I needed to first understand how Fluentd collected Kubernetes metadata. I thought that what I learned might be useful/interesting to others and so decided to write this blog.

As part of my job, I recently had to modify Fluentd to be able to stream logs to our (Zebrium) Autonomous Log Monitoring platform. In order to do this, I needed to first understand how Fluentd collected Kubernetes metadata. I thought that what I learned might be useful/interesting to others and so decided to write this blog.

Read More

Part 1 - Machine learning for logs

In our last blog we discussed the need for Autonomous Monitoring solutions covering the three pillars of observability (metrics, traces and logs). At Zebrium we have started with logs (but stay tuned for more). This is because logs generally represent the most comprehensive source of truth during incidents, and are widely used to search for the root cause.

In our last blog we discussed the need for Autonomous Monitoring solutions to help developers and operations users keep increasingly large and complex distributed applications up and running.

Read More

Getting anomaly detection right by structuring logs automatically

Observability means being able to infer the internal state of your system through knowledge of external outputs. For all but the simplest applications, it’s widely accepted that software observability requires a combination of metrics, traces and events (e.g. logs). As to the last one, a growing chorus of voices strongly advocates for structuring log events upfront.

 

Observability means being able to infer the internal state of your system through knowledge of external outputs. For all but the simplest applications, it’s widely accepted that software observability requires a combination of metrics, traces and events (e.g. logs). As to the last one, a growing chorus of voices strongly advocates for structuring log events upfront. Why? Well, to pick a few reasons - without structure you find yourself dealing with the pain of unwieldy text indexes, fragile and hard to maintain regexes, and reactive searches. You’re also impaired in your ability to understand patterns like multi-line events (e.g. stack traces), or to correlate events with metrics (e.g. by transaction ID).

Read More

Do your logs feel like a magic 8 ball?

Logs are the source of truth when trying to uncover latent problems in a software system. But is searching logs to find the root cause the right approach?

Logs are the source of truth when trying to uncover latent problems in a software system. They are usually too messy and voluminous to analyze proactively, so they are used mostly for reactive troubleshooting once a problem is known to have occurred.

Read More

Using machine learning to detect anomalies in logs

At Zebrium, we have a saying: “Structure First”. We talk a lot about structuring because it allows us to do amazing things with log data. But most people don’t know what we mean when we say the word “structure”, or why it allows for amazing things like anomaly detection. This is a gentle and intuitive introduction to “structure” as we mean it.

At Zebrium, we have a saying: “Structure First”. We talk a lot about structuring because it allows us to do amazing things with log data. But most people don’t know what we mean when we say the word “structure”, or why it allows for amazing things like anomaly detection. This is a gentle and intuitive introduction to “structure” as we mean it.

Read More

Autonomous log monitoring for Kubernetes

Kubernetes makes it easy to deploy, manage and scale large distributed applications. But what happens when something goes wrong with an app? And how do you even know?

Kubernetes makes it easy to deploy, manage and scale large distributed applications. But what happens when something goes wrong with an app? And how do you even know? We hear variations on this all the time: “It was only when customers started complaining that we realized our service had degraded”, “A simple authentication problem stopped customers logging. It took six hours to resolve.”, and so on.

Read More

Using machine learning to shine a light inside the monitoring black box

A widely prevalent application monitoring strategy today is sometimes described as “black box” monitoring. Black box monitoring focuses just on externally visible symptoms, including those that approximate the user experience. Black box monitoring is a good way to know when things are broken.

Read More

The hidden complexity of hiding complexity

Kubernetes and other orchestration tools use abstraction to hide complexity. Deploying, managing and scaling a distributed application are made easy. But what happens when something goes wrong? And, when it does, do you even know?

Kubernetes and other orchestration tools use abstraction to hide complexity. Deploying, managing and scaling a distributed application are made easy. But what happens when something goes wrong? And, when it does, do you even know?

Read More

Using ML and logs to catch problems in a distributed Kubernetes deployment

It is especially tricky to identify software problems in the kinds of distributed applications typically deployed in k8s environments. There’s usually a mix of home grown, 3rd party and OSS components – taking more effort to normalize, parse and filter log and metric data into a manageable state. In a more traditional world tailing or grepping logs might have worked to track down problems, but that doesn’t work in a Kubernetes app with a multitude of ephemeral containers.

It is especially tricky to identify software problems in the kinds of distributed applications typically deployed in k8s environments. There’s usually a mix of home grown, 3rd party and OSS components – taking more effort to normalize, parse and filter log and metric data into a manageable state. In a more traditional world tailing or grepping logs might have worked to track down problems, but that doesn’t work in a Kubernetes app with a multitude of ephemeral containers. You need to centralize logs, but that comes with its own problems. The sheer volume can bog down the text indexes of traditional logging tools. Centralization also adds confusion by breaking up connected events (such as multi-line stack traces) in interleaved outputs from multiple sources.

Read More

Catching Faults Missed by APM and Monitoring tools

As software gets more complex, it gets harder to test all possible failure modes within a reasonable time. Monitoring can catch known problems – albeit with pre-defined instrumentation. But it’s hard to catch new (unknown) software problems. 

A quick, free and easy way to find anomalies in your logs

 

Read More

Deploying into Production: The need for a Red Light

As scale and complexity grow, there are diminishing returns from pre-deployment testing. A test writer cannot envision the combinatoric explosion of coincidences that yield calamity. We must accept that deploying into production is the only definitive test.

Read More

Using ML to auto-learn changing log structures

Software log messages are potential goldmines of information, but their lack of explicit structure makes them difficult to programmatically analyze. Tasks as common as accessing (or creating an alert on) a metric in a log message require carefully crafted regexes that can easily capture the wrong data by accident (or break silently because of changing log formats across software versions). But there’s an even bigger prize buried within logs – the possibility of using event patterns to learn what’s normal and what’s anomalous. 

Why understand log structure at all?

Read More

Please don't make me structure logs!

As either a developer or a member of a DevOps team, you have undoubtedly dealt with logs; probably lots and lots of messy logs. It's one of the first things we all look to when trying to get to the bottom of an issue and determine root cause.

Read More

Reliable signatures to detect known software faults

Have you ever spent time tracking down a bug or failure, only to find you’ve seen it before? Or a variation of this problem: at the completion of automated test you have to spend time triaging each failure, even though many are caused by the same bug. 

Have you ever spent time tracking down a bug or failure, only to find you’ve seen it before? Or a variation of this problem: at the completion of automated test you have to spend time triaging each failure, even though many are caused by the same bug. All this can impact productivity, especially in continuous integration and continuous deployment (CI/CD) environments, where things change rapidly.

Read More

Perfectly structuring logs without parsing

Developers and testers constantly use log files and metrics to find and troubleshoot failures. But their lack of structure makes extracting useful information without data wrangling, regexes and parsing scripts a challenge.

Developers and testers constantly use log files and metrics to find and troubleshoot failures. But their lack of structure makes extracting useful information without data wrangling, regexes and parsing scripts a challenge.

Read More

Troubleshooting the easy way

It takes great skill, tenacity and sometimes blind luck to find the root cause of a technical issue. Zebrium has created a better way!

It takes great skill, tenacity and sometimes blind luck to find the root cause of a technical issue. And for complex problems, more often than not, it involves leveraging log files, metrics and traces.  Whether you’re a tester triaging problems found during automated test, or a developer assisting with a critical escalation, dealing with data is painful.

Read More

Product analytics at your fingertips

According to Gartner, product analytics “help manufacturers evaluate product defects, identify opportunities for product improvements, detect patterns in usage or capacity of products, and link all these factors to customers". The benefits are clear. But there are barriers – product analytics are expensive in terms of time, people and expertise.

 

According to Gartner, product analytics, “help manufacturers evaluate product defects, identify opportunities for product improvements, detect patterns in usage or capacity of products, and link all these factors to customers". The benefits are clear. But there are barriers – product analytics are expensive in terms of time, people and expertise.

Read More

Structure is Strategic

We structure machine data at scale

Zebrium helps dev and test engineers find hidden issues in tests that “pass”, find root-cause faster than ever, and validate builds with self-maintaining problem signatures. We ingest, structure, and auto-analyze machine data - logs, stats, and config - collected from test runs.

 

We structure machine data at scale

 

Read More
AUTONOMOUS INCIDENT
DETECTION
 
TRY IT NOW!
 
FREE SIGN-UP