Using ML and logs to catch problems in a distributed Kubernetes deployment

October 3, 2019 | Ajay Singh

It is especially tricky to identify software problems in the kinds of distributed applications typically deployed in k8s environments. There’s usually a mix of home grown, 3rd party and OSS components – taking more effort to normalize, parse and filter log and metric data into a manageable state. In a more traditional world tailing or grepping logs might have worked to track down problems, but that doesn’t work in a Kubernetes app with a multitude of ephemeral containers. You need to centralize logs, but that comes with its own problems. The sheer volume can bog down the text indexes of traditional logging tools. Centralization also adds confusion by breaking up connected events (such as multi-line stack traces) in interleaved outputs from multiple sources.

 

The biggest problem is that logs are still treated as a passive repository of semi-structured text. The typical workflow is to rely on some other (monitoring tool) to detect a symptom, and then rely on human intuition to search logs until the problem is finally identified. This is unfortunate, because logs contain a rich and broad trail of events and embedded metrics. These include the leading indicators of many problems before they become severe enough to impact users (and trip monitoring alerts). The challenge is that their large volume and lack of consistent structure makes it hard to proactively extract insights, particularly in a fast-changing Kubernetes environment.

 

What you need is a tool that automatically tells you if something anomalous (or known to be bad) is happening in the logs. When a bad problem ripples through multiple services, it should take you to the leading edge of the ripple – the root cause of the entire problem. And since you will always encounter new problems, all of this should work without having to manually instrument everything or pre-build queries.

 

This is where machine learning helps. First, machine learning can build the underlying event dictionary of all unique event types generated by a distributed app. Even for a large app that generates 100s of millions of events a day, this foundational dictionary typically only consists of a few thousand unique event types. This makes it practical for a 2nd layer of machine learning to understand the normal patterns and behavior of each event type –  whether it has ever been seen before, and if so, it’s normal frequency, periodicity, severity, correlations and typical values of metrics embedded in the events.

 

This means when something goes wrong, anomalous patterns based on event types are automatically detected with much higher signal to noise than brute force approaches that rely on detecting spikes in keywords or errors. Types of patterns detected include a normally occurring event that has stopped (as in the example mentioned here). Or when an extremely rare event is seen. Or when there was a sudden change in the frequency/periodicity/severity of some event types.

 

What if something bad happens to a foundational service, and the problem ripples through many dependent services, generating a plethora of rare errors or events? A simplistic approach might overwhelm a human with a swarm of alerts. This is where a third layer of machine learning comes into play – to understand the correlation between anomalies (including across multiple services), with the goal of catching the leading edge (the service that first went south and is the very likely culprit).

 

The graphic below shows the inner workings in a visual heatmap:

heatmap

 

The horizontal dimension shows events happening in multiple services, while the vertical dimension is time. The brightness of the dots represents the level of surprise – events occurring at a normal cadence show up as greys, while surprising events are blue and the most surprising of all are white. A vertical grey stripe that stops means an event that typically occurs in a healthy system stopped happening. And a horizontal stripe across services means a systemic problem cascading across the system. In the latter case – the leading edge of the problem is the service that first exhibited highly unusual anomalies. In the (real) picture above, this leading edge is the 2 events near the left edge of the picture, which were related to an outage in the underlying database service of a multi-service application. This approach is remarkably successful in catching problems that a human missed (and have not yet manifested in monitoring/APM tools). And it does an extremely good job of suppressing noise – typically only alerting an operator about one in several million events as problems deserving attention.

 

A free, fully functional version of this service is now in beta for K8s users. It takes 2 kubectl commands and <5 minutes to get going. 

 

Please click here to try it out.

Tags: machine learning, logs, k8s, kubernetes