Monitoring is about catching unexpected changes in application behavior. Traditional monitoring tools achieve this through alert rules and spotting outliers in dashboards. While this traditional approach can typically catch failure modes with obvious service impacting symptoms, it has two limitations:
However, this doesn’t scale as application complexity grows and failure modes multiply. There are simply too many unknown failure modes, too much data to scan, and too many variants of log messages to search for. Some new entrants to the space focus on improving the speed, cost or scalability of search. While welcome, this does nothing to address the bottleneck – which is the human brain’s ability to know what to search for in a given situation.
Many tools have started to offer add-on machine learning features to augment human effort. While a welcome addition, it still leaves too much work for the human. Users are required to choose specific time series to track, choose the right anomaly detection technique for each one, and then be presented with visualizations of outliers to further analyze.
This can be useful in
But anomaly detection alone is not enough to detect and root cause incidents that might impact a modern application, particularly the unknown unknowns or new failure modes that are encountered on an ongoing basis. This is because the nature of modern, dynamic applications means that there are always anomalies - sometimes hundreds or thousands each day. In reality, only a tiny fraction of these really matter - the ones that describe real software incidents. But traditional anomaly detection feature sets do not detect incidents or automatically characterize root cause.
There are other limitations. Most anomaly detection features only offer this toolset for time series data (e.g. metrics). It is rare to see a meaningful anomaly detection solution for logs, let alone the ability to correlate log and metric anomalies. This makes it more time consuming to root cause issues, and requires further drill down such as log searches.
By contrast, Zebrium built machine learning (ML) based anomaly detection as the foundation, so took a very different approach. Our software automatically detects anomalies in ALL log events, and ALL metrics. We do all the work for the user – no need to handpick which log events or metrics to track via anomaly detection, nor algorithms, corrections and other controls. We automatically apply the best technique for each type of data and adapt the behavior as the application changes.
But this is not the special part. What’s unique about this approach is that it automatically
This has two huge benefits:
This is why we believe Autonomous Monitoring, which includes autonomous Incident Recognition is the future of monitoring. You can try it for yourself. Getting started is free and takes less than two minutes.