We believe the future of monitoring, especially for platforms like Kubernetes, is truly autonomous. Cloud native applications are increasingly distributed, evolving faster and failing in new ways, making it harder to monitor, troubleshoot and resolve incidents. Traditional approaches such as dashboards, carefully tuned alert rules and searches through logs are reactive and time intensive, hurting productivity, the user experience and MTTR. We believe machine learning can do much better – detecting anomalous patterns automatically, creating highly diagnostic incident alerts and shortening time to resolution.
What do you imagine when you see "Anomaly Detection"?
When you think about anomaly detection, you probably visualize it for metrics: detection of outlier values like peaks, dropouts, or other deviations from normal. In the realm of application monitoring metrics are scheduled measurements of a large set (thousands) of system health attributes – such as CPU, memory, latency, throughput. So metrics anomaly detection can be a useful tool to detect application health incidents, with the metrics anomalies serving as symptoms of the incident. There are limitations, one of which is that metric anomaly detection can be noisy – requiring some curation, adjustments for seasonality and trendlines, plus thoughtful algorithm selection and tuning.
Anomaly detection applied to logs is very different
Log events are generated synchronous with execution of specific software paths. This makes them incredibly granular (micro-second or even finer resolution). They are also rich: since log events are only generated when specific conditions are encountered, they can selectively output data with almost unbounded cardinality (such as labels and IDs). Most significantly, events provide the best indication of causality. Where metrics measure aggregate symptoms about the application, log events are closely linked to specific code paths or error conditions in the software. For all of the above reasons, logs are an invaluable trove of information, so troubleshooting invariably involves digging through logs to find out root cause of an incident.
What if you didn’t have to do this reactively? Why couldn’t anomaly detection also be applied to events, with the goal of detecting highly unusual patterns of code execution, or rare errors or conditions. In other words, patterns that are diagnostic of a software problem, an infrastructure issue that impacts the application, or even security incidents. This is a bit harder to conceptualize than anomaly detection for metrics, but here is how it works.
Learn what to track
Metrics are explicitly tagged with labels and IDs – so it is clear what is being measured. Unfortunately, the link between a specific log event and the corresponding line of code is not explicit – most log events don’t contain references to source code, and they are typically unstructured, free form outputs coded by developers to help them troubleshoot. As a result, many of them look similar to the human eye because they contain similar keywords or strings.
Luckily machine learning can do far better than a human in this regard – it only needs to see a few variants of each message type to fully extract the fixed and variable parts of each message type – rapidly learning all the unique message types. This essentially constitutes the “dictionary” of all unique event types generated by the application stack – all that’s
missing is the corresponding line # from the source code.
Note that this event type dictionary is not as big as you might think – an entire Atlassian suite has fewer than 1,000 unique event types.
Learn the normal, detect the abnormal
Once we’ve assembled this foundational dictionary of event types, another layer of ML learns the normal patterns of each event type. This includes things such as its frequency, periodicity, severity, and even the values of metrics embedded within each event type. Now when a log event breaks pattern significantly, it is anomalous.
Particularly important variants include the first occurrence of a very rare event, and the sudden stoppage of a normal event (e.g. a system heartbeat).
Increase the contrast between signal and noise
In practice you can’t stop there – most enterprise applications have dozens of services, with hundreds if not thousands of instances (many of them ephemeral), scaling operations and frequent updates. Good anomaly detection can be very selective - picking out the one in a million event that is truly anomalous. However, one in a million would still be too many things to focus on in an environment that generates billions of messages a day. Once again, machine learning to the rescue – it takes advantage of the fact that a single anomaly in one event type is rarely alert worthy – but when tightly clustered group of anomalies pops up across multiple event streams – that IS almost always alert worthy. What constitutes “unusually tight cluster” depends on the specific deployment of an application, so it needs to be learned on the fly.
See a complete narrative
This type of anomaly clustering doesn’t just improve signal to noise by a several orders of magnitude. It also constructs an automatic summary of the incident – picking out the sequence of events that fully describe it.
For instance, the following incident was autonomously created in response to a chaos test (pod-delete) – notice how the ML picked out the full sequence of events, from the beginning of the actual chaos test.
Side note: one might wonder if an even more complete narrative couldn’t include metric anomalies as well – correlating causes and symptoms, and potentially further improving signal to noise? Stay tuned for more on this…..
Done right log anomaly detection can enable autonomous incident creation, making it an incredibly powerful pillar of a monitoring strategy. It complements metrics-based monitoring, detects latent problems before they impact metrics, and reduces MTTR by automatically surfacing the event sequence that describes an incident. But doing it right means understanding unique event types, learning their patterns and correlations, and detecting anomaly clusters with good signal to noise. And for this to be practical, all of this has to work without extensive configuration, manual tuning, or impractical training windows.