In our last blog we discussed the need for Autonomous Monitoring solutions to help developers and operations users keep increasingly large and complex distributed applications up and running.
Although Autonomous Monitoring includes all three pillars of observability (metrics, traces and logs), at Zebrium we have started with logs (but stay tuned for more). This is because logs generally represent the most comprehensive source of truth during incidents, and are widely used to search for the root cause. Log management and log monitoring is also an area we feel hasn’t evolved much in the past 20 years. Most log solutions are still designed around “aggregate, index and search”. And they are mostly used reactively by skilled users who manually search for the root cause.
The main reason logging tools haven’t evolved much in the past two decades is because using Machine Learning (ML) with logs is hard. Logs are incredibly vast, noisy and mostly unstructured. To date ML work in the log space has been either purely academic, or limited to detecting basic anomalies that are both noisy and don’t easily roll up into real incidents that users need to know about.
This blog series will go into detail of how Zebrium has taken a unique approach to applying machine learning to logs, but to understand how the approach is superior, our story starts at what approaches have been tried previously.
Machine learning for logs
Machine Learning (ML) uses statistical models to make predictions. For monitoring logs, a useful prediction would be the ability to classify whether a particular log event, or set of events, are causing a real incident that requires action to resolve. Another useful prediction would be to correlate an incident to the root cause so users can easily rectify the issue.
In ML, usually the more data available, the more accurate the ML model will be at making predictions. This is why models usually become more accurate over time. However, this has two challenges – it leads to a long lead time to value, i.e. the system requires several days or weeks of data to serve accurate predictions and not raise false alerts (also referred to as “false positives”).
Worse, slow learning ML is actually not very useful when the behavior of the application itself keeps changing, for example because frequent updates are being deployed for each of its microservices. If the accuracy is poor, then we eventually will start ignoring the model as it will generate too many spammy alerts.
There are also two main approaches for training ML models on existing data: supervised and unsupervised. Supervised training requires a labelled data set, usually produced manually by humans, to help the model understand the cause and effect of the data. For example, we may label all log events that relate to a real incident so the model will recognize that incident again if it sees the same log events or pattern.
As you can imagine, this can take a lot of effort, especially considering the millions of potential failure modes complex software services can generate. Therefore, another approach used to train ML models, is Unsupervised training. In this approach, the model will try and figure out patterns and correlations in the data set by itself, which can then be used to serve predictions.
The challenge with using ML with logs, however, is every environment is different. Although there may be some common third-party services shared between environments (e.g. open source components like MySQL, NGinX, Kubernetes, etc.), there will likely also be custom applications that are unique to a particular environment and generating a unique stream of logs and patterns.
This means that any approach that needs to be trained on an environment’s specific data will not work unless the other environments run the same components. In addition, unless we want to invest a lot of resources and time for humans to accurately label the data, the models must be able to train unsupervised.
Another challenge, is any ML approach needs to figure out how to be accurate at predictions quickly and with limited data, to ensure the user isn’t waiting days or weeks for accurate alerts to be generated.
With these challenges in mind, we need an ML solution that can train quickly on a relatively small dataset and do this unsupervised, to ultimately generate accurate incident predictions across unique environments, and keep learning as an application continually evolves.
Existing Approaches & Challenges
While there have been a lot of academic papers on the subject, the approaches typically fall into two categories which are explained below:
This category refers to algorithms that have been designed to detect anomalous patterns in string-based data. Two popular models in this category are Linear Support Vector Machines (SVM) and Random Forrest.
Using SVM as an example, it classifies the probability that certain words in a log line are correlated with an incident. Some words such as “error” or “unsuccessful” may correlate with an incident and receive a higher probability score than other words such as “successful” or “connected”. The combined score of the message is used to detect an issue.
Both SVM and Random Forrest models use supervised learning for training and require a lot of data to serve accurate predictions. As we discussed earlier, unless we are only running common 3rd party software, where we can collect and label a lot of common log samples for training, this approach will not work well in new environments running bespoke custom software, as the models need to be trained on a large labelled data set from the new log samples generated by that specific environment.
These approaches also try to do anomaly detection using the raw log event messages. This may work well for individual log events but will be far too noisy to only detect real incidents. When incidents occur, we need to detect pattern changes across the entire log set, and not look for issues in individual log events.
Deep learning is a very powerful form of ML, generally called Artificial Intelligence (AI). By training neural networks on large volumes of data, Deep Learning can find patterns in data, but generally is used with Supervised training using labeled datasets. AI has been used for hard problems such as image and speech recognition with great results.
One of the best academic papers for this approach is the Deeplog paper from the University of Utah, which uses deep learning to detect anomalies in logs. Interestingly, they have also applied ML to parse logs into event types, which is similar to Zebrium’s approach discussed later, as this significantly improves the accuracy of the anomaly detection.
The challenge with this approach again, is that it requires a large volume of data to become accurate. Which means new environments will take longer before they can serve accurate predictions, and smaller environments may never produce enough data for the model to be accurate enough.
However, unlike the statistical algorithms discussed previously, another issue with Deep Learning is it is very compute intensive to train. Many data scientists will run expensive GPU instances to train models quicker, but at significant cost. If we need to train the model on every unique environment individually, and continuously over time, this would be an extremely expensive way to detect incidents autonomously, and therefore this approach is not recommended for monitoring logs for environments running custom software.
Some vendors have trained deep learning algorithms on common 3rd party services (i.e. MySQL, NGinX etc.). This approach can work as they can take a large volume of publicly available datasets and error modes to train the model, and the trained model can be deployed to all their users. However, as no environment is only running these 3rd party services and has custom software that’s only running in that environment, this approach is limited to only discovering incidents in 3rd party services, and not the custom software running in the environment itself.
Taking A Different Approach
As we’ve discussed above, generalized algorithms and deep learning as has been applied up until now, have too many limitations to provide a truly autonomous, unsupervised, log monitoring solution.
Instead of applying a single approach, Zebrium has taken a multi-layer approach, first parsing the logs into normalized events, running anomaly detection across every single event, and then detecting changes in patterns that may indicate an incident is occurring.
The next blogs will go into detail on each step of this process and discuss how it helps to achieve accurate autonomous incident detection and root cause identification.