Zebrium Blog

Using ML and logs to catch problems in a distributed Kubernetes deployment

October 3, 2019 | Ajay Singh

It is especially tricky to identify software problems in the kinds of distributed applications typically deployed in k8s environments. There’s usually a mix of home grown, 3rd party and OSS components – taking more effort to normalize, parse and filter log and metric data into a manageable state. In a more traditional world tailing or grepping logs might have worked to track down problems, but that doesn’t work in a Kubernetes app with a multitude of ephemeral containers.

It is especially tricky to identify software problems in the kinds of distributed applications typically deployed in k8s environments. There’s usually a mix of home grown, 3rd party and OSS components – taking more effort to normalize, parse and filter log and metric data into a manageable state. In a more traditional world tailing or grepping logs might have worked to track down problems, but that doesn’t work in a Kubernetes app with a multitude of ephemeral containers. You need to centralize logs, but that comes with its own problems. The sheer volume can bog down the text indexes of traditional logging tools. Centralization also adds confusion by breaking up connected events (such as multi-line stack traces) in interleaved outputs from multiple sources.

Read More

Catching Faults Missed by APM and Monitoring tools

August 19, 2019 | Ajay Singh

As software gets more complex, it gets harder to test all possible failure modes within a reasonable time. Monitoring can catch known problems – albeit with pre-defined instrumentation. But it’s hard to catch new (unknown) software problems. 

A quick, free and easy way to find anomalies in your logs

 

Read More

Deploying into Production: The need for a Red Light

July 23, 2019 | Larry Lancaster

As scale and complexity grow, there are diminishing returns from pre-deployment testing. A test writer cannot envision the combinatoric explosion of coincidences that yield calamity. We must accept that deploying into production is the only definitive test.

Read More

Using ML to auto-learn changing log structures

July 14, 2019 | David Adamson

Software log messages are potential goldmines of information, but their lack of explicit structure makes them difficult to programmatically analyze. Tasks as common as accessing (or creating an alert on) a metric in a log message require carefully crafted regexes that can easily capture the wrong data by accident (or break silently because of changing log formats across software versions). But there’s an even bigger prize buried within logs – the possibility of using event patterns to learn what’s normal and what’s anomalous. 

Why understand log structure at all?

Read More

Please don't make me structure logs!

June 27, 2019 | Rod Bagg

As either a developer or a member of a DevOps team, you have undoubtedly dealt with logs; probably lots and lots of messy logs. It's one of the first things we all look to when trying to get to the bottom of an issue and determine root cause.

Read More

Getting anomaly detection right by structuring logs automatically

June 20, 2019 | Ajay Singh

Observability means being able to infer the internal state of your system through knowledge of external outputs. For all but the simplest applications, it’s widely accepted that software observability requires a combination of metrics, traces and events (e.g. logs). As to the last one, a growing chorus of voices strongly advocates for structuring log events upfront.

 

Observability means being able to infer the internal state of your system through knowledge of external outputs. For all but the simplest applications, it’s widely accepted that software observability requires a combination of metrics, traces and events (e.g. logs). As to the last one, a growing chorus of voices strongly advocates for structuring log events upfront. Why? Well, to pick a few reasons - without structure you find yourself dealing with the pain of unwieldy text indexes, fragile and hard to maintain regexes, and reactive searches. You’re also impaired in your ability to understand patterns like multi-line events (e.g. stack traces), or to correlate events with metrics (e.g. by transaction ID).

Read More

Using machine learning to detect anomalies in logs

June 5, 2019 | Larry Lancaster
At Zebrium, we have a saying: “Structure First”. We talk a lot about structuring because it allows us to do amazing things with log data. But most people don’t know what we mean when we say the word “structure”, or why it allows for amazing things like anomaly detection. This is a gentle and intuitive introduction to “structure” as we mean it.

At Zebrium, we have a saying: “Structure First”. We talk a lot about structuring because it allows us to do amazing things with log data. But most people don’t know what we mean when we say the word “structure”, or why it allows for amazing things like anomaly detection. This is a gentle and intuitive introduction to “structure” as we mean it.

Read More

Reliable signatures to detect known software faults

May 22, 2019 | Gavin Cohen

Have you ever spent time tracking down a bug or failure, only to find you’ve seen it before? Or a variation of this problem: at the completion of automated test you have to spend time triaging each failure, even though many are caused by the same bug. 

Have you ever spent time tracking down a bug or failure, only to find you’ve seen it before? Or a variation of this problem: at the completion of automated test you have to spend time triaging each failure, even though many are caused by the same bug. All this can impact productivity, especially in continuous integration and continuous deployment (CI/CD) environments, where things change rapidly.

Read More

Perfectly structuring logs without parsing

May 16, 2019 | Gavin Cohen

Developers and testers constantly use log files and metrics to find and troubleshoot failures. But their lack of structure makes extracting useful information without data wrangling, regexes and parsing scripts a challenge.

Developers and testers constantly use log files and metrics to find and troubleshoot failures. But their lack of structure makes extracting useful information without data wrangling, regexes and parsing scripts a challenge.

Read More

Troubleshooting the easy way

February 9, 2019 | Gavin Cohen

It takes great skill, tenacity and sometimes blind luck to find the root cause of a technical issue. Zebrium has created a better way!

It takes great skill, tenacity and sometimes blind luck to find the root cause of a technical issue. And for complex problems, more often than not, it involves leveraging log files, metrics and traces.  Whether you’re a tester triaging problems found during automated test, or a developer assisting with a critical escalation, dealing with data is painful.

Read More