Monitoring today puts far too much burden on DevOps and developers. These teams spend countless hours staring at dashboards, hunting through logs, and maintaining fragile alert rules. Fortunately, unsupervised machine learning can be applied to logs and metrics to autonomously detect and find the root cause of critical incidents.
If you you don't have time to read the full whitepaper now, please read our blog: The Future of Monitoring is Autonomous.
Modern applications are evolving faster, becoming increasingly distributed and failing in new ways, making it harder to monitor, troubleshoot and resolve incidents. Traditional approaches such as dashboards, carefully tuned alert rules and log searches are reactive and time intensive, hurting productivity and Mean-Time-To-Resolution (MTTR).
Machine learning (ML) can do much better – detecting anomalous patterns automatically, creating highly diagnostic incident alerts and shortening time to resolution – without requiring human supervision or configuration. But simple approaches to anomaly detection will not get us there. This white paper describes some of the limitations of traditional approaches and discusses newer approaches that are able to achieve significantly better results.
Logs and metrics are the two most common sources of data to detect and troubleshoot application problems. Important metrics are typically tracked via dashboards, with alerts used selectively to generate incidents when certain “symptom” metrics deviate from their healthy range. Anomaly detection can be an improvement over static metric thresholds, by utilizing forecast/outlier models for example that take seasonality and other variations into account.
However, traditional implementations still require a human to curate which metrics will be tracked via anomaly detection and pick the best models to avoid false positives. And when an incident is detected, DevOps, SREs or engineers still need to manually drill down into other metrics and eventually into logs to determine root cause.
Historically, logs have been collected and used mostly for reactive troubleshooting of issues found by other monitoring and alerting tools. This is because even though logs typically record the source of truth during an incident, they’re too vast and noisy to easily be used for incident alerting.
This is even more problematic since the sources and number of logs being generated are going up exponentially as we move towards cloud, container and microservices architectures, making it even more difficult to search logs and find the root cause during an incident.
We are already spending significant effort and resources to send and collect all our logs in a central log management solution that simply stores and indexes them for search. However, beyond that, today’s log management solutions don’t empower users to easily detect unknown incidents before they affect customers or minimize downtime by helping users find the root cause quickly.
Fortunately, new approaches that use machine learning are being developed to solve this problem. They automatically detect log and metrics patterns to catch incidents and correlate them with the root cause. These approaches are built to work at scale and can finally turn our logs and metrics into a more pro-active monitoring solution, that not only automatically detects incidents we were not looking for, but also can take us directly to the oot cause when an incident occurs.
Because many vendors are jumping on the AI bandwagon, it can be hard to cut through the noise and understand which solutions are viable and genuinely work in the real world. This paper explains the approach Zebrium has taken to develop machine learning for logs and metrics, and why it presents a superior approach over existing methods, resulting in higher accuracy of unsupervised incident detection and also the ability to correlate incidents to root cause.
As mentioned above, metrics anomaly detection can be a useful tool to detect application health incidents, with the metrics anomalies serving as symptoms of the incident.
Traditional time series anomaly detection is designed to track handpicked metrics, using carefully curated approaches such as closest neighbor, local outlier factor (LOF), or moving average (ARMA) based outliers. This can help catch problems in critical user facing metrics. But it has downsides:
The Zebrium approach is very different. Instead of requiring a user to handpick the right metrics and the right outlier algorithms, it takes advantage of the fact that when software incidents occur, they almost never impact just one metric. For example, memory contention on a node will impact multiple containers. Similarly, network bottlenecks can impact latency for many operations which show up in multiple metrics.
So the Zebrium approach works by detecting unusually correlated pockets of metric anomalies. This achieves superior results to other approaches that focus on catching just one anomaly in one metric.
The Zebrium approach has several advantages: it can act on all metrics, not just a handpicked few. And it no longer requires extensive curation or tuning of the algorithms – just detecting new max, min, plateaus and sharp changes is enough. It also catches completely new failure modes and does a great job at filtering out the noise by picking out the hotspots of correlated anomalies that are indicators of real problems.
Finally, and most importantly, this approach can be made even more effective by correlating these metric anomaly hotspots with anomalies detected in log events. This has the added benefit of also highlighting details of root cause to assist with incident resolution.
ML in general uses statistical models to make predictions. For monitoring logs, a useful prediction would be the ability to classify whether a particular log event, or set of events, are causing a real incident that requires action to resolve. Another useful prediction would be to correlate an incident to the root cause so users can easily rectify the issue.
In ML, usually the more data available, the more accurate the ML model will be at making predictions. This is why models usually become more accurate over time. However, this has two challenges – it leads to a long lead time to value, i.e. the system requires several days or weeks of data to serve accurate predictions and not raise false alerts (also referred to as “false positives”).
Worse, slow learning ML is actually not very useful when the behavior of the application itself keeps changing, for example because frequent updates are being deployed for each of its microservices. If the accuracy is poor, then we eventually will start ignoring the model as it will generate too many spammy alerts.
There are also two main approaches for training ML models on existing data: supervised and unsupervised. Supervised training requires a labelled data set, usually produced manually by humans, to help the model understand the cause and effect of the data. For example, we may label all log events that relate to a real incident so the model will recognize that incident again if it sees the same log events or pattern.
As you can imagine, this can take a lot of effort, especially considering the millions of potential failure modes complex software services can generate. Therefore, another approach used to train ML models, is Unsupervised training. In this approach, the model will try and figure out patterns and correlations in the data set by itself, which can then be used to serve predictions.
The challenge with using ML with logs, however, is every environment is different. Although there may be some common third-party services shared between environments (e.g. open source components like MySQL, NGinX, Kubernetes, etc.), there will likely also be custom applications that are unique to a particular environment and generating a unique stream of logs and patterns.
This means that any approach that needs to be trained on an environment’s specific data will not work unless the other environments run the same components. In addition, unless we want to invest a lot of resources and time for humans to accurately label the data, the models must be able to train unsupervised.
Another challenge, is any ML approach needs to figure out how to be accurate at predictions quickly and with limited data, to ensure the user isn’t waiting days or weeks for accurate alerts to be generated.
With these challenges in mind, we need an ML solution that can train quickly on a relatively small dataset and do this unsupervised, to ultimately generate accurate incident predictions across unique environments, and keep learning as an application continually evolves.
While there have been a lot of academic papers on the subject, the approaches typically fall into two categories which are explained below:
This category refers to algorithms that have been designed to detect anomalous patterns in string-based data. Two popular models in this category are Linear Support Vector Machines (SVM) and Random Forrest.
Using SVM as an example, it classifies the probability that certain words in a log line are correlated with an incident. Some words such as “error” or “unsuccessful” may correlate with an incident and receive a higher probability score than other words such as “successful” or “connected”. The combined score of the message is used to detect an issue.
Both SVM and Random Forrest models use supervised learning for training and require a lot of data to serve accurate predictions. As we discussed earlier, unless we are only running common 3rd party software, where we can collect and label a lot of common log samples for training, this approach will not work well in new environments running bespoke custom software, as the models need to be trained on a large labelled data set from the new log samples generated by that specific environment.
These approaches also try to do anomaly detection using the raw log event messages. This may work well for individual log events but will be far too noisy to only detect real incidents. When incidents occur, we need to detect pattern changes across the entire log set, and not look for issues in individual log events.
Deep learning is a very powerful form of ML, generally called Artificial Intelligence (AI). By training neural networks on large volumes of data, Deep Learning can find patterns in data, but generally is used with Supervised training using labeled datasets. AI has been used for hard problems such as image and speech recognition with great results.
One of the best academic papers for this approach is the Deeplog paper from the University of Utah, which uses deep learning to detect anomalies in logs. Interestingly, they have also applied ML to parse logs into event types, which is similar to Zebrium’s approach discussed later, as this significantly improves the accuracy of the anomaly detection.
The challenge with this approach again, is that it requires a large volume of data to become accurate. Which means new environments will take longer before they can serve accurate predictions, and smaller environments may never produce enough data for the model to be accurate enough.
However, unlike the statistical algorithms discussed previously, another issue with Deep Learning is it is very compute intensive to train. Many data scientists will run expensive GPU instances to train models quicker, but at significant cost. If we need to train the model on every unique environment individually, and continuously over time, this would be an extremely expensive way to detect incidents autonomously, and therefore this approach is not recommended for monitoring logs for environments running custom software.
While there are many ways to find interesting anomalies in logs, if we want to detect real incidents accurately, we need to understand the structure of an incident. Generally, an incident creates a change of pattern in several log events over a short window of time. This in turn means we need to reliably detect changes in event patterns, which means we first need to be able to precisely distinguish unique event types. Therefore, we need a multi-layer approach for detecting incidents, each of which must operate completely unsupervised:
The foundational step is to automatically parse the structure of the raw logs and turn them into a dictionary of distinct event types. This uses machine learning and can start creating the dictionary of event types after only seeing a few hundred lines from a log. It’s also smart enough to adapt to changing event structures – e.g. when a developer adds a parameter to an event or changes the spelling of a word. It automatically identifies any event changes and updates the dictionary to reflect the changes over time:
This step essentially normalizes billions of raw log events into only a few thousand unique event types. This means we can now see the patterns of particular event types in the logs. Does an event type have a heartbeat and repeat regularly? Does an event correlate in a similar pattern every time another event is triggered? Etc.
It also automatically recognizes and parses fields in a log message, so we can also perform anomaly detection on field values. This has the added benefit of allowing us to run SQL analytics queries across all logs to build reports on specific fields without having to write a single parsing rule.
A deeper explanation of how Zebrium’s log parsing works can be found here, but essentially because the frequency of different log event messages can widely vary, a range of algorithms are applied to automatically recognize and parse log messages into unique event types. Some algorithms work well for low frequency event messages, and some work best for high frequency event messages. Zebrium will automatically select the best algorithm to parse the messages and adapt to the frequency of the messages in the logs.
Only once the logs are normalized, can we then start applying anomaly detection algorithms to each event type.
For each event type, a statistical method called Point Process is used to model the characteristics of the event type such as its rate, periodicity and frequency seen in the logs.
Because it recognizes patterns for specific event types, and not the raw log messages themselves, it also allows Zebrium to detect anomalous patterns like a regular event (heartbeat) stopping. For example, in this Stripe incident two database nodes stopped reporting replication metrics four days prior to an outage, but the problem went undetected until a failover event triggered an outage.
This step will produce a stream of anomalies for each event type which can then be used by the next step to detect patterns across the anomalies which would indicate an incident.
Logs are vast and noisy. There will always be anomalies in logs and alerting at the anomalies level would create a lot of spammy alerts for users. Therefore, once we detect anomalies for each event type, we need another layer of machine learning to detect correlated patterns of anomalies that indicate an actual incident.
Usually in an incident, a change of log pattern will occur in multiple places (e.g. across containers or across different parts of an application), and Zebrium uses another model to detect these correlated patterns of anomalies to determine whether they indicate a real incident or not. A lot of work has also gone into also suppressing false positives to ensure false incidents are not raised to the user.
Once an incident is detected, the correlated log and metric anomalies are linked together into an incident which then alerts the end user to the issue.
Finally, Zebrium packages the incident into an alert that links to a specific incident page containing all the details of the incident – the sequence of anomalous log events, including the likely root cause indicator and highlighting of the “worst” symptom. In addition, the incident summary contains any anomalous metrics, that correlated with log anomalies. The page also provides feedback mechanisms for a user to tell Zebrium if the incident was real or a false positive, enabling the models to improve over time.
Zebrium’s approach has now been tested across over 1,000 real world incidents, in dozens of diverse custom, open source and commercial applications. It has proven to be highly accurate at not only incident detection, but also identification of the likely root cause of each incident – often saving hours of time. It starts catching incidents within the first hour, and by day it achieves a very high accuracy rate. It typically generates only a handful of incidents a day for a production environment (this can be more for very dynamic dev/staging environments). The false positive rate is also good even out of the gate (typically <1/3), and quickly gets better with user feedback (if an incident type is marked as unimportant , a similar incident will never generate an alert).
Also, because of the multi-faceted approach of applying different models to different events based on their occurrence, the models can generally start detecting incidents within a couple of hours of Zebrium receiving logs from a new environment.
As an additional benefit to the Zebrium approach, because the parsing model is able to automatically parse and extract fields for every log event it receives, users gain the added benefit of being able to apply analytics across their entire logs without any additional configuration.
As the complexity of software systems and volume of logs and metrics substantially increases, it is inevitable that a Machine Learning solution will be required to detect incident patterns and find root causes. Without ML, users will be forced to manually setup brittle alert rules, that overwhelm them with false positives or miss incidents entirely as an environment changes over time. Without ML, users will have to spend countless hours manually scanning charts and searching large volumes of logs for root causes during an incident. This will inevitably lead to more downtime, slower MTTR, more customer churn and ultimately revenue loss if the increasingly complex software services become unreliable.
This paper reviewed why applying ML to logs and metrics is hard, and why existing approaches have failed to detect real-life incidents well and at scale. It also demonstrated how Zebrium’s novel approach to the problem enables it to achieve a much higher accuracy rate without human supervision.
The Zebrium platform is available to try for free today. Our users are seeing outstanding results. And the incident detection accuracy of our ML models is continually improving as we receive more data to enable our vision of Autonomous Monitoring. “Autonomous” means being completely zero touch in configuration and ongoing usage. We believe users should simply be able to send their logs and metrics to Zebrium, and without any other configuration, have our system automatically detect real incidents and correlate them with root cause.
If you also believe that the only sustainable future is for monitoring to be Autonomous, please try our platform for free here. Getting started takes less than 2 minutes.