Kubernetes makes it easy to deploy, manage and scale large distributed applications. But what happens when something goes wrong with an app? And how do you even know? We hear variations on this all the time: “It was only when customers started complaining that we realized our service had degraded”, “A simple authentication problem stopped customers logging. It took six hours to resolve.”, and so on.
The underlying challenge is the complexity due to an almost infinite number of possible interactions that can take place in a distributed environment. Most DevOps teams struggle with two questions:
How do you know if something critical is broken?
Monitoring tools can generate alerts when critical thresholds are reached. This can be an effective way of catching problem symptoms (like increased latency or number of dropped sessions). But thresholds are often reached well after the root cause of a problem has occurred. And knowing a symptom is present, does not mean you know why it occurred and how to resolve it.
How do you know what caused a problem?
Logs, together with metrics, contain the most complete record of what happened. But traditional techniques of “search and drill-down” break down due the sheer volume of messy log data from multiple containers, pods, nodes and clusters. This is why it often takes an “all-hands-on-deck” approach – where DevOps and engineering spend countless hours hunting through all the data to piece together what happened.
The Zebrium approach – powered by machine learning
It turns out, when it comes to finding and determining the cause of critical software problems in Kubernetes deployments, machine learning can do a better job than humans and traditional logging/monitoring tools. And it can do so without manual effort. The technology works by monitoring logs and metrics in real time, learning their normal patterns/correlations, and then uncovering anomalous patterns to reliably find critical software problems. Details of the underlying technology can be found here, here and here.
The benefits of this approach are:
- It automatically finds critical software problems without any human supervision
- It detects problems earlier by catching the “leading edge” rather than symptoms
- It speeds-up time to resolution by taking you straight to the specific events that relate to root cause
- It reduces noise by detecting related problems (such as cascading errors in multiple microservices) and only alerting on root cause
Real life results
During testing with early users, we collected real-world data for 100+ incidents from 30+ unique application stacks. In 56% of cases we were able to detect the actual problem and pinpoint its leading edge – completely unaided. We also expect to see continual improvement as our models learn with more data.
Try it now