Zebrium’s technology finds the root cause of software problems by using machine learning (ML) to analyze logs. The majority of our customers stream their application and infrastructure logs to our platform for near real-time analysis. However, a new use case has emerged: using our ML to analyze a collection of static logs. This is particularly relevant for technical support teams who collect “bundles” of logs from their customers after a problem has occurred. The results we’re seeing are nothing short of spectacular!
Zebrium’s ML learns the normal patterns of all event types within the logs, and generates root cause reports when clusters of correlated anomalies and errors are detected. This is a great help for SREs, DevOps engineers and developers who are often under pressure to root cause problems as they happen.
However, there are many scenarios where you may not have a continuous stream of logs:
- Technical support engineers often need to troubleshoot applications or devices installed in end-user environments, and only receive log file bundles when a problem occurs.
- Some logs might be located on endpoints (e.g. device logs or logs from your collaboration client or security client) and are only collected when troubleshooting a problem.
- Cases where security policies prohibit streaming logs.
- Many users put Zebrium to the test by uploading a collection of logs from a particularly complex historical incident. In fact, some of our largest sales have occurred after a customer sees just how quickly Zebrium can uncover details of the root cause.
Challenges With Finding the Root Cause in Log Files
Some of the challenges of log-troubleshooting apply to log files as much as to log streams. Simple approaches like manually searching for errors are quickly overwhelmed by the sheer volume of errors and warnings generated by typical software. Moreover, errors tend to be symptoms – so such an approach isn’t usually sufficient to identify the root cause. Some teams try to build a library of “health checks” based on experience with prior problems, but this approach does not scale as new issues are encountered weekly (and old ones mutate because of changes in software behavior or log formats).
More sophisticated approaches use a tool to try to identify anomalies in the logs. However, most attempts at anomaly detection fall short because the results are too noisy or they miss the most important “rare” events entirely. Anomaly detection is challenging because logs are text centric and unstructured (or at best loosely structured), so it is very hard to classify them well enough that rare events can be picked up reliably. Finally, even finding anomalies and errors still does not identify the correlations (cause and effect) between them – an important aspect of constructing a root cause timeline.
Log files have the added challenge of containing limited history. Typically a set of logs might contain a few hours (or at best a few days) of history before the problem occurred – making it even harder to know what is normal and what is anomalous.
How Zebrium’s ML Works with Log Files
With experience from thousands of incidents and a wide spectrum of software environments under our belt, we have continued to refine our ML capabilities to the point it can learn based on a short history, pick out anomalies and correlate them with related events. It is ideal to have 24 hours or greater of history, but even a few hours of logs is enough to allow the ML to find correlated clusters of anomalies and errors that produce useful RCA reports.
In order to effectively support uploading log file bundles, we enhanced our file upload APIs to allow users to annotate log files with arbitrary metadata such as process/host name, incident ID or software version. This allows the machine learning to find correlations across different components and makes it easy to for users to navigate through large volumes. In addition, we added support for logs with historical dates, so root cause reports are no longer limited to just recent timestamps. We can also support exports from other log managers (such as Elastic or Splunk).
How Effective is Using our ML on Log Files?
Support engineers from multiple large companies have validated the ML against log file bundles from a wide array of products – including security software, networking software, networking devices and collaboration clients. The testing involved uploading a set of log files and verifying that Zebrium’s root cause reports contained the exact log events a skilled human would consider to be the best explanation of root cause for that particular incident. Across all products tested, Zebrium’s ML achieved accuracy rates well north of 90%.
Killer Feature for Support: Slashing Resolution Time for Repeat Problems
When Zebrium creates a new root cause report, it builds a fingerprint for that type of problem (we call this an incident type). A user can then associate additional meta-data (such as a title, URL, description, etc.) for an incident type. Now, when the same issue occurs again – e.g. a customer sends a support bundle for a known problem - it will be automatically detected and any user annotations/links will be immediately visible. But even better, using our webhooks and alerting, a repeat problem can trigger an automated remediation workflow.
The machine learning will also recognize if a developer ever changes the log format of an event that is relevant to a “fingerprint”, and will auto-update these fingerprints so all the existing annotations/rules carry forward. Contrast this with a traditional regex based health-check rule which would simply stop working without any warning.
Zebrium’s ML lets you automatically find the root cause of software problems found in logs. With our latest release, we cater for support engineers or anyone that has the need to analyze a set of static log files. Try it for yourself!