DevOps.com webinar - Autonomous Incident and Root Cause Detection - July 7 @ 10am PDT

Register Now

Zebrium Blog

Zebrium + Grafana = Awesome

June 19, 2020 | Larry Lancaster

Lots of people like to construct dashboards in Grafana, for monitoring and alerting – it’s fast, sleek, and practical. Zebrium is awesome for analytics in-part because we lay everything down into tables in a scale-out MPP relational column store at ingest. Each event type gets its own table with typed columns for the parameters; metrics data is also tabled; ditto for anomalies and incidents.

Read More

You've Nailed Incident detection, what about Incident Resolution?

June 3, 2020 | Rod Bagg

You probably care a lot about Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR). You're also no doubt familiar with monitoring, incident response, war rooms and the like. Imagine if you could automatically augment an incident detected by any monitoring, APM or tracing tool with details of root cause?

If you're reading this, you probably care a lot about Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR). You're also no doubt familiar with monitoring, incident response, war rooms and the like. Who of us hasn't been ripped out of bed or torn away from family or friends at the most inopportune times? I know firsthand from running world-class support and SRE organizations that it all boils down to two simple things: 1) all software systems have bugs and 2) It's all about how you respond. While some customers may be sympathetic to #1, without exception, all of them still expect early detection, acknowledgement of the issue and near-immediate resolution. Oh, and it better not ever happen again!

Read More

Is Log Management Still the Best Approach?

May 29, 2020 | Gavin Cohen

Part of our product does what most log managers do: aggregates logs, makes them searchable, allows filtering, provides easy navigation and lets you build alert rules. So why write this blog? Because in today’s cloud native world, while useful, log managers can be a time sink when it comes to detecting and tracking down the root cause of software incidents.

Disclosure – I work for Zebrium. Part of our product does what most log managers do: aggregates logs, makes them searchable, allows filtering, provides easy navigation and lets you build alert rules. So why write this blog? Because in today’s cloud native world (microservices, Kubernetes, distributed apps, rapid deployment, testing in production, etc.) while useful, log managers can be a time sink when it comes to detecting and tracking down the root cause of software incidents.

Read More

Beyond Anomaly Detection – How Incident Recognition Drives down MTTR

May 19, 2020 | Ajay Singh

Monitoring is about catching unexpected changes in application behavior. Traditional monitoring tools achieve this through alert rules and spotting outliers in dashboards. While this traditional approach can typically catch failure modes obvious service impacting symptoms, it has several limitations.

The State of Monitoring

Monitoring is about catching unexpected changes in application behavior. Traditional monitoring tools achieve this through alert rules and spotting outliers in dashboards. While this traditional approach can typically catch failure modes with obvious service impacting symptoms, it has two limitations:

Read More

Busting the Browser's Cache

A new release of your web service has just rolled out with some awesome new features and countless bug fixes. A few days later and you get a call: Why am I not seeing my what-ch-ma-call-it on my thing-a-ma-gig? After setting up that zoom call it is clear that the browser has cached old code, so you ask the person to hard reload the page with Ctrl-F5. Unless its a Mac in which case you need Command-Shift-R. And with IE you have to click on Refresh with Shift. You need to do this on the other page as well. Meet the browser cache, the bane of web service developers!

 

Read More

Is Autonomous monitoring the anomaly detection you actually wanted?

April 15, 2020 | Larry Lancaster

Automatically Spot Critical Incidents and Show Me Root Cause

That's what I wanted from a tool when I first heard of anomaly detection. I wanted it to do this based only on the logs and metrics it ingests, and alert me right away, with all this context baked in...

Automatically Spot Critical Incidents and Show Me Root Cause

Read More

Anomaly Detection as a foundation of Autonomous Monitoring

April 6, 2020 | Ajay Singh

We believe the future of monitoring, especially for platforms like Kubernetes, is truly autonomous. Cloud native applications are increasingly distributed, evolving faster and failing in new ways, making it harder to monitor, troubleshoot and resolve incidents. Traditional approaches such as dashboards, carefully tuned alert rules and searches through logs are reactive and time intensive, hurting productivity, the user experience and MTTR. 

We believe the future of monitoring, especially for platforms like Kubernetes, is truly autonomous. Cloud native applications are increasingly distributed, evolving faster and failing in new ways, making it harder to monitor, troubleshoot and resolve incidents. Traditional approaches such as dashboards, carefully tuned alert rules and searches through logs are reactive and time intensive, hurting productivity, the user experience and MTTR. We believe machine learning can do much better – detecting anomalous patterns automatically, creating highly diagnostic incident alerts and shortening time to resolution.

Read More

A Prometheus fork for cloud scale anomaly detection across metrics & logs

March 23, 2020 | Anil Nanduri

At Zebrium, we provide an Autonomous Monitoring service that automatically detects anomalies within logs and metrics. We started by correlating anomalies across log streams to automatically raise incidents that require our user's attention. Now, we have taken the step to augment our incident detection by detecting anomalies within a group of related metrics and correlate those with anomalies found in logs.

Introduction

At Zebrium, we provide an Autonomous Monitoring service that automatically detects anomalies within logs and metrics. We started by correlating anomalies across log streams to automatically raise incidents that require our user's attention. Now, we have taken the step to augment our incident detection by detecting anomalies within a group of related metrics and correlate those with anomalies found in logs.

Read More

What Is an ML Detected Software Incident?

March 10, 2020 | Ajay Singh

Based on our experience with hundreds of incidents across nearly a hundred unique application stacks, we have developed deep insights into the specific ways modern software breaks. This led us to thoughtfully design a multi-layer machine learning stack that can reliably detect these patterns, and identify the collection of events that describes each incident. In simple terms, here is what we have learned about real-world incidents when software breaks.

Based on our experience with hundreds of incidents across nearly a hundred unique application stacks, we have developed deep insights into the specific ways modern software breaks. This led us to thoughtfully design a multi-layer machine learning stack that can reliably detect these patterns, and identify the collection of events that describes each incident. In simple terms, here is what we have learned about real-world incidents when software breaks. You can also try it yourself by signing up for a free account.

Read More

Using Autonomous Monitoring with Litmus Chaos Engine on Kubernetes

March 6, 2020 | David Gildeh

A few months ago, our friends at Maya Data joined our private Beta to give our Autonomous Log Monitoring platform a test run. During the test, they used their newly created Litmus Chaos Engine to generate issues in their Kubernetes cluster, and we managed to detect all of them successfully using our Machine Learning completely unsupervised. Needless to say, they were impressed!

Read More

The Future of Monitoring is Autonomous

March 1, 2020 | David Gildeh

Monitoring today is extremely human driven. The only thing we’ve automated with monitoring to date is the ability to watch for metrics and events that send us alerts when something goes wrong. Everything else: deploying collectors, building parsing rules, configuring dashboards and alerts, and troubleshooting and resolving incidents, requires a lot of manual effort from expert operators that intuitively know and understand the system being monitored.

TL;DR

Monitoring today puts far too much burden on DevOps and developers. These teams spend countless hours staring at dashboards, hunting through logs, and maintaining fragile alert rules. Fortunately, unsupervised machine learning can be applied to logs and metrics to autonomously detect and find the root cause of critical incidents. Read more below, or start using our Autonomous Monitoring Platform for free - it takes less than 2 minutes to get started.  

Introduction

Monitoring today is extremely human driven. The only thing we’ve automated with monitoring to date is the ability to alert on rules that watch for specific metrics and events that occur when something known goes wrong. Everything else - building parsing rules, configuring and maintaining dashboards and alerts, and troubleshooting incidents - requires a lot of manual effort from expert operators that intuitively know and understand the system being monitored.

Read More

Designing a RESTful API Framework

February 6, 2020 | Alan Jones

As the principal responsible for the design of middleware software at Zebrium, I’m writing to share some of the choices we made and how they have held up. Middleware in this context means the business-logic that sits between persistent storage and a web-based user interface.

As the principal responsible for the design of middleware software at Zebrium, I’m writing to share some of the choices we made and how they have held up. Middleware in this context means the business-logic that sits between persistent storage and a web-based user interface.

Read More

How Fluentd collects Kubernetes metadata

January 30, 2020 | Brady Zuo

As part of my job, I recently had to modify Fluentd to be able to stream logs to our (Zebrium) Autonomous Log Monitoring platform. In order to do this, I needed to first understand how Fluentd collected Kubernetes metadata. I thought that what I learned might be useful/interesting to others and so decided to write this blog.

As part of my job, I recently had to modify Fluentd to be able to stream logs to our (Zebrium) Autonomous Log Monitoring platform. In order to do this, I needed to first understand how Fluentd collected Kubernetes metadata. I thought that what I learned might be useful/interesting to others and so decided to write this blog.

Read More

Part 1 - Machine learning for logs

January 24, 2020 | David Gildeh

In our last blog we discussed the need for Autonomous Monitoring solutions covering the three pillars of observability (metrics, traces and logs). At Zebrium we have started with logs (but stay tuned for more). This is because logs generally represent the most comprehensive source of truth during incidents, and are widely used to search for the root cause.

In our last blog we discussed the need for Autonomous Monitoring solutions to help developers and operations users keep increasingly large and complex distributed applications up and running.

Read More

Getting anomaly detection right by structuring logs automatically

January 3, 2020 | Ajay Singh

Observability means being able to infer the internal state of your system through knowledge of external outputs. For all but the simplest applications, it’s widely accepted that software observability requires a combination of metrics, traces and events (e.g. logs). As to the last one, a growing chorus of voices strongly advocates for structuring log events upfront.

 

Observability means being able to infer the internal state of your system through knowledge of external outputs. For all but the simplest applications, it’s widely accepted that software observability requires a combination of metrics, traces and events (e.g. logs). As to the last one, a growing chorus of voices strongly advocates for structuring log events upfront. Why? Well, to pick a few reasons - without structure you find yourself dealing with the pain of unwieldy text indexes, fragile and hard to maintain regexes, and reactive searches. You’re also impaired in your ability to understand patterns like multi-line events (e.g. stack traces), or to correlate events with metrics (e.g. by transaction ID).

Read More

Do your logs feel like a magic 8 ball?

December 17, 2019 | Ajay Singh

Logs are the source of truth when trying to uncover latent problems in a software system. But is searching logs to find the root cause the right approach?

Logs are the source of truth when trying to uncover latent problems in a software system. They are usually too messy and voluminous to analyze proactively, so they are used mostly for reactive troubleshooting once a problem is known to have occurred.

Read More

Using machine learning to detect anomalies in logs

November 25, 2019 | Larry Lancaster
At Zebrium, we have a saying: “Structure First”. We talk a lot about structuring because it allows us to do amazing things with log data. But most people don’t know what we mean when we say the word “structure”, or why it allows for amazing things like anomaly detection. This is a gentle and intuitive introduction to “structure” as we mean it.

At Zebrium, we have a saying: “Structure First”. We talk a lot about structuring because it allows us to do amazing things with log data. But most people don’t know what we mean when we say the word “structure”, or why it allows for amazing things like anomaly detection. This is a gentle and intuitive introduction to “structure” as we mean it.

Read More

Autonomous log monitoring for Kubernetes

November 18, 2019 | Gavin Cohen

Kubernetes makes it easy to deploy, manage and scale large distributed applications. But what happens when something goes wrong with an app? And how do you even know?

Kubernetes makes it easy to deploy, manage and scale large distributed applications. But what happens when something goes wrong with an app? And how do you even know? We hear variations on this all the time: “It was only when customers started complaining that we realized our service had degraded”, “A simple authentication problem stopped customers logging. It took six hours to resolve.”, and so on.

Read More

Using machine learning to shine a light inside the monitoring black box

October 24, 2019 | Ajay Singh

A widely prevalent application monitoring strategy today is sometimes described as “black box” monitoring. Black box monitoring focuses just on externally visible symptoms, including those that approximate the user experience. Black box monitoring is a good way to know when things are broken.

Read More

The hidden complexity of hiding complexity

October 22, 2019 | Gavin Cohen

Kubernetes and other orchestration tools use abstraction to hide complexity. Deploying, managing and scaling a distributed application are made easy. But what happens when something goes wrong? And, when it does, do you even know?

Kubernetes and other orchestration tools use abstraction to hide complexity. Deploying, managing and scaling a distributed application are made easy. But what happens when something goes wrong? And, when it does, do you even know?

Read More

Using ML and logs to catch problems in a distributed Kubernetes deployment

October 3, 2019 | Ajay Singh

It is especially tricky to identify software problems in the kinds of distributed applications typically deployed in k8s environments. There’s usually a mix of home grown, 3rd party and OSS components – taking more effort to normalize, parse and filter log and metric data into a manageable state. In a more traditional world tailing or grepping logs might have worked to track down problems, but that doesn’t work in a Kubernetes app with a multitude of ephemeral containers.

It is especially tricky to identify software problems in the kinds of distributed applications typically deployed in k8s environments. There’s usually a mix of home grown, 3rd party and OSS components – taking more effort to normalize, parse and filter log and metric data into a manageable state. In a more traditional world tailing or grepping logs might have worked to track down problems, but that doesn’t work in a Kubernetes app with a multitude of ephemeral containers. You need to centralize logs, but that comes with its own problems. The sheer volume can bog down the text indexes of traditional logging tools. Centralization also adds confusion by breaking up connected events (such as multi-line stack traces) in interleaved outputs from multiple sources.

Read More

Catching Faults Missed by APM and Monitoring tools

August 19, 2019 | Ajay Singh

As software gets more complex, it gets harder to test all possible failure modes within a reasonable time. Monitoring can catch known problems – albeit with pre-defined instrumentation. But it’s hard to catch new (unknown) software problems. 

A quick, free and easy way to find anomalies in your logs

 

Read More

Deploying into Production: The need for a Red Light

July 23, 2019 | Larry Lancaster

As scale and complexity grow, there are diminishing returns from pre-deployment testing. A test writer cannot envision the combinatoric explosion of coincidences that yield calamity. We must accept that deploying into production is the only definitive test.

Read More

Using ML to auto-learn changing log structures

July 14, 2019 | David Adamson

Software log messages are potential goldmines of information, but their lack of explicit structure makes them difficult to programmatically analyze. Tasks as common as accessing (or creating an alert on) a metric in a log message require carefully crafted regexes that can easily capture the wrong data by accident (or break silently because of changing log formats across software versions). But there’s an even bigger prize buried within logs – the possibility of using event patterns to learn what’s normal and what’s anomalous. 

Why understand log structure at all?

Read More

Please don't make me structure logs!

June 27, 2019 | Rod Bagg

As either a developer or a member of a DevOps team, you have undoubtedly dealt with logs; probably lots and lots of messy logs. It's one of the first things we all look to when trying to get to the bottom of an issue and determine root cause.

Read More

Reliable signatures to detect known software faults

May 22, 2019 | Gavin Cohen

Have you ever spent time tracking down a bug or failure, only to find you’ve seen it before? Or a variation of this problem: at the completion of automated test you have to spend time triaging each failure, even though many are caused by the same bug. 

Have you ever spent time tracking down a bug or failure, only to find you’ve seen it before? Or a variation of this problem: at the completion of automated test you have to spend time triaging each failure, even though many are caused by the same bug. All this can impact productivity, especially in continuous integration and continuous deployment (CI/CD) environments, where things change rapidly.

Read More

Perfectly structuring logs without parsing

May 16, 2019 | Gavin Cohen

Developers and testers constantly use log files and metrics to find and troubleshoot failures. But their lack of structure makes extracting useful information without data wrangling, regexes and parsing scripts a challenge.

Developers and testers constantly use log files and metrics to find and troubleshoot failures. But their lack of structure makes extracting useful information without data wrangling, regexes and parsing scripts a challenge.

Read More

Troubleshooting the easy way

February 9, 2019 | Gavin Cohen

It takes great skill, tenacity and sometimes blind luck to find the root cause of a technical issue. Zebrium has created a better way!

It takes great skill, tenacity and sometimes blind luck to find the root cause of a technical issue. And for complex problems, more often than not, it involves leveraging log files, metrics and traces.  Whether you’re a tester triaging problems found during automated test, or a developer assisting with a critical escalation, dealing with data is painful.

Read More

Product analytics at your fingertips

December 11, 2018 | Gavin Cohen

According to Gartner, product analytics “help manufacturers evaluate product defects, identify opportunities for product improvements, detect patterns in usage or capacity of products, and link all these factors to customers". The benefits are clear. But there are barriers – product analytics are expensive in terms of time, people and expertise.

 

According to Gartner, product analytics, “help manufacturers evaluate product defects, identify opportunities for product improvements, detect patterns in usage or capacity of products, and link all these factors to customers". The benefits are clear. But there are barriers – product analytics are expensive in terms of time, people and expertise.

Read More

Structure is Strategic

October 31, 2018 | Larry Lancaster

We structure machine data at scale

Zebrium helps dev and test engineers find hidden issues in tests that “pass”, find root-cause faster than ever, and validate builds with self-maintaining problem signatures. We ingest, structure, and auto-analyze machine data - logs, stats, and config - collected from test runs.

 

We structure machine data at scale

 

Read More

FREE SIGN-UP