Observability means being able to infer the internal state of your system through knowledge of external outputs. For all but the simplest applications, it’s widely accepted that software observability requires a combination of metrics, traces and events (e.g. logs). As to the last one, a growing chorus of voices strongly advocates for structuring log events upfront. Why? Well, to pick a few reasons - without structure you find yourself dealing with the pain of unwieldy text indexes, fragile and hard to maintain regexes, and reactive searches. You’re also impaired in your ability to understand patterns like multi-line events (e.g. stack traces), or to correlate events with metrics (e.g. by transaction ID).
But perhaps the most important reason to understand event structure is that it enables anomaly detection that really works. And accurate anomaly detection surfaces “unknown unknowns”, a key goal of observability. As applications get more distributed and deployments more frequent, it gets harder for humans to monitor using dashboards and carefully tuned alert rules. And even once you become aware of an incident that impacts your application, it takes time and hunting to get to its root cause. This is because there is always a large and growing set of failure modes you don’t know in advance. Good anomaly detection (i.e. high signal to noise, with few false positives) can not only help you catch these, but also give you an indication of root cause.
There are approaches that attempt anomaly detection on unstructured events. But in practice, this is very hard to do. Or at least to do it well enough to rely on day to day. There are two basic problems without structure – lack of context, and an explosion in cardinality.
Here’s an example of lack of context. Say your anomaly detection tracks counts of keywords such as “FAIL” and “SUCCESS”. The former is bad, the latter good. But it’s not unusual to see messages of the type: “XYZ task did not complete successfully”, quite the opposite of the expected meaning. So just matching keywords would prove unreliable.
It’s even trickier with common words such as “at”. In the right context this keyword is extremely diagnostic, such as the first word of lines number 2-to-N in a multi-line stack trace.
Exception in thread "main" java.lang.NullPointerException
With an approach that understands event structures, the structure:
would be uniquely understood, and its location in a multi-line log sequence would be diagnostically useful – for example identifying a Java exception. But without structure, the keyword “at” is far too common, and a simplistic attempt based on keyword matches alone would generate so much noise as to be completely useless. So, lack of context is one reason keyword matches make a weak foundation for useful anomaly detection.
The other issue is cardinality. As discussed in an earlier blog, our software uses machine learning to automatically distil tens of billions of unstructured log lines down to a much smaller set of perfectly structured event types (with typed variables tracked in associated columns). For example, our entire Atlassian suite has an “event dictionary” of just over 800 unique event types. As a result, it’s easy for us to learn the normal frequency, periodicity and severity of every single event type, and highlight anomalous patterns in any one of them. A very effective way of finding the unknown unknowns. If on the other hand we were trying to detect anomalies in arbitrary keyword combinations or clusters, the cardinality is orders of magnitude higher, making it impractical to do this reliably, at least within practical time and resource constraints.
Does this work in practice? You bet! In real world testing across dozens of applications, we're finding that our approach has turned out to be very accurate in detecting anomalies. For one thing, it doesn’t generate much noise: only about 1 anomaly per millions of events for typical environments. And that’s just the start – a solid foundation of anomaly detection on structured events also enables reliable incident detection by detecting correlated anomalies across microservices. This further sharpens signal-to-noise by several orders of magnitude.
Finally, we’ve proven that the diagnostic “signal” value of such anomaly detection is high – over 2/3 of the time, our anomaly detection accurately picks out the root cause of an incident, saving time and reducing MTTR.
[Update Feb 2020: Subsequent to writing this blog, Zebrium now uses the anomaly "signal"described above and leverages an additional layer of machine learning to uncover clustered hotspots of correlated anomalies across both logs and metrics. This provides extremely accurate software incident detection (see - Is Autonomous monitoring the anomaly detection you actually wanted?)].
Our Autonomous Monitoring platform is available to try for free - please here to get started.