Logs are the source of truth when trying to uncover latent problems in a software system. They are usually too messy and voluminous to analyze proactively, so they are used mostly for reactive troubleshooting once a problem is known to have occurred.
There are two problems with this:
- You miss a chance to detect problems early. Logs often capture latent problems before they get bad enough to impact application or system wide metrics – e.g. things such as rare errors, retries and restarts, memory leaks and OOM messages.
- Once you do know something is wrong, the root cause is often not clear because many problems have similar symptoms that are visible in the logs. So how do you know what to fix to resolve the problem?
When trying to find root cause, the most common analysis involves searching logs for specific errors, messages or keywords based on human intuition and experience. As a result, a lot of attention is given to improving this search experience. Better indexing of the logs. Richer query languages. More structured logs. Bigger scale and faster queries.
But the fundamental problem remains – how do you know what to look for under time pressure? Sometimes it can feel like asking a magic 8 ball.
What if there were a better way? Experienced engineers know that when things go wrong with software, signs show up in the logs. Signs can include rare messages suddenly showing up. Or message patterns deviating from their normal frequency, periodicity or severity. A special case of this might be the sudden disappearance of events that are usually seen when things are healthy. Couldn’t there be an automated way to discover the types of log anomalies that an experienced engineer will usually spot with careful inspection?
It turns out that if done right, machine learning can accomplish this. It can not only discover anomalous patterns in logs, but also learn correlations between them, figure out which anomalous patterns belong to a single incident, and even identify the likely root cause (vs symptoms). And since doing it right involves learning unique message types and their structure, this improves your search experience in two ways – it gives you a very clear starting point to look around, and a far more powerful tool-set for drilling down (vs text search).
The Zebrium approach does just this – it can deploy and learn message types and structures from the entire application stack within a few minutes, learn normal patterns quickly, and on day zero correctly start to identify root cause of over 2/3rd of software incidents – with zero configuration, training or supervision. And this rate continues to improve thanks to user feedback. Read more about how it works in this white paper.