Blog | Zebrium (2)

Anomaly Detection as a foundation of Autonomous Monitoring

April 6, 2020 | Ajay Singh

We believe the future of monitoring, especially for platforms like Kubernetes, is truly autonomous. Cloud native applications are increasingly distributed, evolving faster and failing in new ways, making it harder to monitor, troubleshoot and resolve incidents. Traditional approaches such as dashboards, carefully tuned alert rules and searches through logs are reactive and time intensive, hurting productivity, the user experience and MTTR.

A Prometheus fork for cloud scale anomaly detection across metrics & logs

March 23, 2020 | Anil Nanduri

Prometheus Fork: Cloud Scale Log Anomaly Detection for DevOps | Zebrium

At Zebrium, we provide an Autonomous Monitoring service that automatically detects anomalies within logs and metrics. We started by correlating anomalies across log streams to automatically raise incidents that require our user's attention. Now, we have taken the step to augment our incident detection by detecting anomalies within a group of related metrics and correlate those with anomalies found in logs.

Introduction

What Is an ML Detected Software Incident?

March 10, 2020 | Ajay Singh

Based on our experience with hundreds of incidents across nearly a hundred unique application stacks, we have developed deep insights into the specific ways modern software breaks. This led us to thoughtfully design a multi-layer machine learning stack that can reliably detect these patterns, and identify the collection of events that describes each incident. In simple terms, here is what we have learned about real-world incidents when software breaks.

Using Autonomous Monitoring with Litmus Chaos Engine on Kubernetes

March 6, 2020 | David Gildeh

Autonomous Monitoring with Chaos Engineering on Kubernetes | Zebrium

A few months ago, our friends at Maya Data joined our private Beta to give our Autonomous Log Monitoring platform a test run. During the test, they used their newly created Litmus Chaos Engine to generate issues in their Kubernetes cluster, and we managed to detect all of them successfully using our Machine Learning completely unsupervised. Needless to say, they were impressed!

Single Sign-On with OAuth

March 6, 2020 | Alan Jones

There is a lot of material on OAuth (OAuth2, OpenID Connect, and friends) but not much focused on the application of OAuth to the problem of Single Sign-On (SSO). This post seeks to do so.

The Future of Monitoring is Autonomous

March 1, 2020 | David Gildeh

The Future of Monitoring uses AI on logs & metrics | Zebrium

Monitoring today is extremely human driven. The only thing we’ve automated with monitoring to date is the ability to watch for metrics and events that send us alerts when something goes wrong. Everything else: deploying collectors, building parsing rules, configuring dashboards and alerts, and troubleshooting and resolving incidents, requires a lot of manual effort from expert operators that intuitively know and understand the system being monitored.

TL;DR

Monitoring today puts far too much burden on DevOps and developers. These teams spend countless hours staring at dashboards, hunting through logs, and maintaining fragile alert rules. Fortunately, unsupervised machine learning can be applied to logs and metrics to autonomously detect and find the root cause of critical incidents. Read more below, or start using our Autonomous Monitoring Platform for free - it takes less than 2 minutes to get started.

Introduction

Monitoring today is extremely human driven. The only thing we’ve automated with monitoring to date is the ability to alert on rules that watch for specific metrics and events that occur when something known goes wrong. Everything else - building parsing rules, configuring and maintaining dashboards and alerts, and troubleshooting incidents - requires a lot of manual effort from expert operators that intuitively know and understand the system being monitored.

Designing a RESTful API Framework

February 6, 2020 | Alan Jones

As the principal responsible for the design of middleware software at Zebrium, I’m writing to share some of the choices we made and how they have held up. Middleware in this context means the business-logic that sits between persistent storage and a web-based user interface.

How Fluentd collects Kubernetes metadata

January 30, 2020 | Brady Zuo

As part of my job, I recently had to modify Fluentd to be able to stream logs to our (Zebrium) Autonomous Log Monitoring platform. In order to do this, I needed to first understand how Fluentd collected Kubernetes metadata. I thought that what I learned might be useful/interesting to others and so decided to write this blog.

Getting anomaly detection right by structuring logs automatically

January 3, 2020 | Ajay Singh

Observability means being able to infer the internal state of your system through knowledge of external outputs. For all but the simplest applications, it’s widely accepted that software observability requires a combination of metrics, traces and events (e.g. logs). As to the last one, a growing chorus of voices strongly advocates for structuring log events upfront. Why? Well, to pick a few reasons - without structure you find yourself dealing with the pain of unwieldy text indexes, fragile and hard to maintain regexes, and reactive searches. You’re also impaired in your ability to understand patterns like multi-line events (e.g. stack traces), or to correlate events with metrics (e.g. by transaction ID).

Do your logs feel like a magic 8 ball?

December 17, 2019 | Ajay Singh

Logs are the source of truth when trying to uncover latent problems in a software system. But is searching logs to find the root cause the right approach?

Logs are the source of truth when trying to uncover latent problems in a software system. They are usually too messy and voluminous to analyze proactively, so they are used mostly for reactive troubleshooting once a problem is known to have occurred.

Autonomous log monitoring for Kubernetes

November 18, 2019 | Gavin Cohen

Kubernetes makes it easy to deploy, manage and scale large distributed applications. But what happens when something goes wrong with an app? And how do you even know?

Kubernetes makes it easy to deploy, manage and scale large distributed applications. But what happens when something goes wrong with an app? And how do you even know? We hear variations on this all the time: “It was only when customers started complaining that we realized our service had degraded”, “A simple authentication problem stopped customers logging. It took six hours to resolve.”, and so on.

Using machine learning to shine a light inside the monitoring black box

October 24, 2019 | Ajay Singh

A widely prevalent application monitoring strategy today is sometimes described as “black box” monitoring. Black box monitoring focuses just on externally visible symptoms, including those that approximate the user experience. Black box monitoring is a good way to know when things are broken.

The hidden complexity of hiding complexity

October 22, 2019 | Gavin Cohen

Kubernetes and other orchestration tools use abstraction to hide complexity. Deploying, managing and scaling a distributed application are made easy. But what happens when something goes wrong? And, when it does, do you even know?

Using ML and logs to catch problems in a distributed Kubernetes deployment

October 3, 2019 | Ajay Singh

It is especially tricky to identify software problems in the kinds of distributed applications typically deployed in k8s environments. There’s usually a mix of home grown, 3^rd party and OSS components – taking more effort to normalize, parse and filter log and metric data into a manageable state. In a more traditional world tailing or grepping logs might have worked to track down problems, but that doesn’t work in a Kubernetes app with a multitude of ephemeral containers. You need to centralize logs, but that comes with its own problems. The sheer volume can bog down the text indexes of traditional logging tools. Centralization also adds confusion by breaking up connected events (such as multi-line stack traces) in interleaved outputs from multiple sources.

Catching Faults Missed by APM and Monitoring tools

August 19, 2019 | Ajay Singh

As software gets more complex, it gets harder to test all possible failure modes within a reasonable time. Monitoring can catch known problems – albeit with pre-defined instrumentation. But it’s hard to catch new (unknown) software problems.

A quick, free and easy way to find anomalies in your logs

Deploying into Production: The need for a Red Light

July 23, 2019 | Larry Lancaster

As scale and complexity grow, there are diminishing returns from pre-deployment testing. A test writer cannot envision the combinatoric explosion of coincidences that yield calamity. We must accept that deploying into production is the only definitive test.

Using ML to auto-learn changing log structures

July 14, 2019 | David Adamson

Software log messages are potential goldmines of information, but their lack of explicit structure makes them difficult to programmatically analyze. Tasks as common as accessing (or creating an alert on) a metric in a log message require carefully crafted regexes that can easily capture the wrong data by accident (or break silently because of changing log formats across software versions). But there’s an even bigger prize buried within logs – the possibility of using event patterns to learn what’s normal and what’s anomalous.

Why understand log structure at all?

Please don't make me structure logs!

June 27, 2019 | Rod Bagg

As either a developer or a member of a DevOps team, you have undoubtedly dealt with logs; probably lots and lots of messy logs. It's one of the first things we all look to when trying to get to the bottom of an issue and determine root cause.

Reliable signatures to detect known software faults

May 22, 2019 | Gavin Cohen

Have you ever spent time tracking down a bug or failure, only to find you’ve seen it before? Or a variation of this problem: at the completion of automated test you have to spend time triaging each failure, even though many are caused by the same bug. All this can impact productivity, especially in continuous integration and continuous deployment (CI/CD) environments, where things change rapidly.

Perfectly structuring logs without parsing

May 16, 2019 | Gavin Cohen

Developers and testers constantly use log files and metrics to find and troubleshoot failures. But their lack of structure makes extracting useful information without data wrangling, regexes and parsing scripts a challenge.

Troubleshooting the easy way

February 9, 2019 | Gavin Cohen

It takes great skill, tenacity and sometimes blind luck to find the root cause of a technical issue. Zebrium has created a better way!

It takes great skill, tenacity and sometimes blind luck to find the root cause of a technical issue. And for complex problems, more often than not, it involves leveraging log files, metrics and traces. Whether you’re a tester triaging problems found during automated test, or a developer assisting with a critical escalation, dealing with data is painful.

Product analytics at your fingertips

December 11, 2018 | Gavin Cohen

According to Gartner, product analytics “help manufacturers evaluate product defects, identify opportunities for product improvements, detect patterns in usage or capacity of products, and link all these factors to customers". The benefits are clear. But there are barriers – product analytics are expensive in terms of time, people and expertise.

According to Gartner, product analytics, “help manufacturers evaluate product defects, identify opportunities for product improvements, detect patterns in usage or capacity of products, and link all these factors to customers". The benefits are clear. But there are barriers – product analytics are expensive in terms of time, people and expertise.

Structure is Strategic

October 31, 2018 | Larry Lancaster

We structure machine data at scale

Zebrium helps dev and test engineers find hidden issues in tests that “pass”, find root-cause faster than ever, and validate builds with self-maintaining problem signatures. We ingest, structure, and auto-analyze machine data - logs, stats, and config - collected from test runs.

Zebrium Blog

Anomaly Detection as a foundation of Autonomous Monitoring

A Prometheus fork for cloud scale anomaly detection across metrics & logs

Introduction

What Is an ML Detected Software Incident?

Using Autonomous Monitoring with Litmus Chaos Engine on Kubernetes

Single Sign-On with OAuth

The Future of Monitoring is Autonomous

TL;DR

Introduction

Designing a RESTful API Framework

How Fluentd collects Kubernetes metadata

Getting anomaly detection right by structuring logs automatically

Do your logs feel like a magic 8 ball?

Autonomous log monitoring for Kubernetes

Using machine learning to shine a light inside the monitoring black box

The hidden complexity of hiding complexity

Using ML and logs to catch problems in a distributed Kubernetes deployment

Catching Faults Missed by APM and Monitoring tools

Deploying into Production: The need for a Red Light

Using ML to auto-learn changing log structures

Please don't make me structure logs!

Reliable signatures to detect known software faults

Perfectly structuring logs without parsing

Troubleshooting the easy way

Product analytics at your fingertips

Structure is Strategic

We structure machine data at scale

Recent Posts

Tags

Archive

Search By Tags

Archive

Links

Contact

Zebrium Blog

Introduction

TL;DR

Introduction

We structure machine data at scale

Recent Posts

hbspt.cta._relativeUrls=true;hbspt.cta.load(4228532, '423c2129-0918-428b-9027-4cbe01c35dce', {"useNewLoader":"true","region":"na1"});

Tags

Archive

Search By Tags

Archive

Links

Contact