Kubernetes and other orchestration tools use abstraction to hide complexity. Deploying, managing and scaling a distributed application are made easy. But what happens when something goes wrong? And, when it does, do you even know?
We hear this all the time: “We had an outage last month, but what was really bad is we didn’t know until our customers told us”, “a simple authentication problem stopped customers from logging in for six hours”, and so on. Recent issues at Stripe and Cloudflare further demonstrate this.
What’s surprising is that these occurred at companies with world class teams using world class orchestration and monitoring tools. So, what went wrong? Simply put, these are all highly complex distributed applications in which an almost infinite number of possible interactions can take place. This makes detecting and understanding problems extremely challenging.
So how do you detect and solve problems in such environments? If you had an infinite amount of time and resource, a complete analysis (including correlations) across all logs and metrics would likely give you the answers. But the sheer data volume from each instance of each component (and micro-service) makes this impractical. Instead, the best that can be done is to build logic ahead of time to detect known problems and symptoms. This is where monitoring tools come in.
Unfortunately, monitoring can miss some problems entirely since it’s impossible to know ahead of time all the possible things that could go wrong. Also, it often catches the measurable symptom of the problem (latency spike, transaction rate change, etc.), but not its cause – this is why many problems take a long time to resolve. And by the time a symptom is noticed, it might be too late. This was the case at Stripe where there were two underlying problems – one three months and the second four days before the incident.
What if you could programmatically analyze logs and metrics at scale and use this to uncover critical software problems – even completely new ones? We have been working on this challenge for several years.
A first layer of machine learning discerns the structure of each event, extracts its metrics and variables, and builds a dictionary of unique “event types” (there are typically a few thousand in a complex application). The event type dictionary makes it possible for a second layer of machine learning to learn the normal patterns of each event type – and to uncover anomalous patterns. Examples of learnt patterns include event periodicity, frequency, correlations, severity, first occurrence, expected values and ranges of embedded metrics, etc.
The goal it to be able to reliably detect critical problems without any human intervention. And, just as importantly, do so without generating “alarm fatigue”. To accomplish this, we implemented a third layer of machine learning. It turns out that just finding anomalous patterns is not enough because some pathological problems impact many services – i.e. generate a lot of anomalies. So a third layer of machine learning is used to learn the patterns across anomalies and to find the “leading event(s)” that caused the condition.
Real life testing across almost 50 different applications shows that this technique is able to reliably uncover critical problems. In some cases, the system was able to find the leading edge of a problem long before its symptoms would otherwise have been detected.
The system has been in test for several months and and is now in beta for Kubernetes users. Getting started takes less than five minutes and it's free for up to 1GB a day.