A widely prevalent application monitoring strategy today is sometimes described as “black box” monitoring. Black box monitoring focuses just on externally visible symptoms, including those that approximate the user experience. Black box monitoring is a good way to know when things are broken.
An alternative approach is called “white box” monitoring, which requires also looking at the internal workings of the system, to know what is actually going on (and catch failure modes). It is similar in concept to the more recent term “observability” – basically inferring the internal state of a system by observing a sufficiently complete set of signals.
Black box monitoring has one advantage which accounts for its prevalence – it keeps things simple. Because most complex applications have a vast number of possible failure modes, just focusing on a well-chosen set of failure symptoms is a lot easier than trying to detect and distinguish between the myriad failure modes.
On the other hand, many teams want to incorporate some white box monitoring in their strategy (e.g. Google SRE book and My philosophy on alerting). This is because white box monitoring has two advantages of its own:
- It can help you pick up latent or imminent failures, before things actually break – things like degraded services, memory leaks, buffer overflows, escalating retries, etc. For most applications, it’s useful to know about problems earlier to avoid or minimize user impact.
- And it can actually tell you what broke and what action you need to take. Many failure modes have similar external symptoms, so just knowing the symptom doesn’t actually help you fix things – that requires further drill down or troubleshooting.
When the end goal is to keep the application healthy (or at the very least fix problems quickly), white box monitoring deserves serious consideration. The problems of internal complexity can be tackled with a well-designed tool set of machine learning.
Let’s start with the data. The most under-utilized signals describing the internals of the application are in the logs. Logs contain the richest record of the application behavior – which is why most troubleshooting eventually involves spelunking through them. Unfortunately, they are usually messy, free form text messages. This is ok for simple things like text searches to find keywords such as ERRORs. Unfortunately, logs are generally not structured enough for reliable multi-event alerts, effective pattern learning, or reliable anomaly detection (that doesn’t generate a ton of spurious alerts). The good news is that the type of brute force, repetitive scanning and parsing that a data engineer would do to structure logs is a perfect application for machine learning – automatically turning raw text into cleanly structured event tables, parameters and metrics extracted, typed and stored as columns.
The attraction of this type of structuring is to turn a potentially unbounded pattern learning problem into a tractable problem of learning the behavior of a bounded number of unique log events. Even a complex application that generates billions of logs a day eventually distils down to a dictionary containing a few thousand unique message types. This makes it possible to reliably learn the “normal” pattern of each message type, so that when things break, anomalous patterns are readily detected via changes in frequency, periodicity, severity, etc. (note: an interesting special case is when a normally occurring message STOPs occurring).
This is a big step up in “learning” the inner workings of a complex machine, but not enough. A distributed application may experience a lot of routine activity across dozens of services and hundreds of instances, with containers coming and going, services upgrading and restarting – none of which are actually problems. A refinement is to distinguish between a slightly unusual event and a truly severe error or highly unusual break in pattern. But even this could easily overwhelm a team with accurate but practically useless alerts if it caught all of these “micro” disruptions.
The next step is to refine signal to noise by auto-detecting correlations between these anomalous events. As the Wilde quote goes: “To lose one parent may be regarded as a misfortune; to lose both looks like carelessness.” In that vein, disruptions in one micro-service might be planned or an otherwise benign issue, but a pattern of highly anomalous events that ripples across multiple services is far more likely to be a real problem than a harmless coincidence. Generating only one alert for such a scenario (for the leading edge of the problem) does a good job of “opening the black box”, raising early awareness about a serious failure that may not yet be visible in the symptoms. Moreover, it reduces noise – recognizing and coalescing related issues into one failure warning.
The above description isn’t hypothetical. It’s built into a SaaS service that caught 56% of serious failure modes (and improving) in testing across a wide range of application stacks, while generating very little spurious noise. Some examples include:
- Detected critical authentication issue six hours before the APM and monitoring tools caught the problem.
- Detected database restart in a stack trace which would have led to service disruptions.
- Caught early indicator of an outage by detecting a crash of a critical service. Quick drill down exposed memory buffer overflow as the root cause.
- Detected a rare error while saving encryption keys, which would have led to loss of data if undetected.
- Auto-detected a spike in cache refresh time that impacted user experience (missed by monitoring tool as metric had not been instrumented).
It works in real time without big training data sets or human supervision (usually learns within a couple of minutes of seeing data). It is now in beta and is free for up to 1GB a day.