The Elastic Stack (often called the ELK Stack) is one of the most widely deployed observability, log management and log monitoring platforms. It's typically comprises Beats, Elasticsearch, Logstash and Kibana (but there are many other variants). Now you can add Zebrium to the mix and automatically find root cause!
What is ZELK Stack?
You can think of ZELK Stack as “AIOps for Elastic Stack”. Without any complex config, manual alert rules or training, it uses machine learning to proactively catch software incidents and show you root cause right inside Kibana.
It’s different to other machine learning anomaly detection technologies because rather than producing dashboards of anomalies (which can be noisy and require significant time to interpret), it detects “incidents” and produces clear “Incident Reports” containing details of root cause. More on this later.
Why ZELK Stack?
Detecting and diagnosing root cause of software problems is a slow and tedious process, particularly for new (previously unseen) failure modes. Think about what a skilled SRE or developer typically does when troubleshooting an unknown or complex problem: First there’s hunting through logs and dashboards for familiar clues and error conditions. Once the familiar ones are exhausted, it’s about looking for new or rare errors, or event patterns that seem different from the norm. And finally, it might involve correlating event sequences across logs from different services, not to mention metrics. The goal is to pinpoint the event sequence that explains the problem from root cause to symptoms. It’s a painful process for even the most skilled operators.
Now imagine that instead, when something bad happens in your environment, you simply look at an Incident Dashboard in Kibana. And with a click you’re shown an “incident report” that contains the key log events and metric charts that explain exactly what happened (root cause and symptoms). This is what ZELK Stack brings to your ELK Stack!
ZELK Stack configuration only takes a few minutes (see docs). Just add a Logstash output to send logs to Zebrium and an optional input for Zebrium to send incident reports back to Logstash. The rest is automatic. And accurate incident detection occurs within the first day.
How Zebrium integrates with the Elastic Stack
Proactive incident detection
When Zebrium’s machine learning detects an incident (you can read more about what a Zebrium incident means here), it will automatically show up in an Elasticsearch incident index. Below, you can see a Kibana canvas that visualizes the auto-detected incidents. It’s important to point out that these incidents were created by our machine learning without any manual rules or training.
The buttons behave as follows:
- Detail – Drills down to a Kibana discover or logs view that shows root cause of the incident as a series of log events that make up the incident. Here’s an example of a view drill-down – note that our ML has picked out just seven correlated events from different log streams (out of millions) that explain what happened:
- Like, Mute and Spam – Provides feedback to Zebrium ML and customizes how future similar events will appear in the incident list.
- The link “Launch in Zebrium” allows you to drill down on the incident inside the Zebrium UI. The result is similar to using Detail except that the Ze interface offers some very useful drill-down features compared to native Kibana.
How does this compare to Elastic ML anomaly detection?
Elastic X-Pack supports ML anomaly detection (included in the Elastic platinum pricing tier). In particular, it can find anomalous log rates based on overall ingest or based on log event categories. Categorization works by using a string similarity algorithm to categorize “similar” kinds of log messages. You can then set up machine learning jobs to look for rare or anomalous counts of categories within each user defined “time bucket”.
The screen shot below illustrates how you can see where particular event categories reach anomalous counts.
This type of anomaly detection can be useful when looking for problems, however, it tends to produce noisy results (false positives) and requires significant human effort to find correlations and to understand details of root cause.
Finds incidents and root cause, not Anomalies
Zebrium uses a multi-layered machine learning approach (see How it Works) that first uses ML to structure, parse and categorize logs and metrics. Next it learns the patterns for each event type and scores each new incoming event based on how “anomalous” it is (it looks at frequency, periodicity, when it started, when it stopped, correlations, values of parameters, severity, etc.). The next layer of ML is the crucial one – it finds hotspots of abnormally correlated patterns across logs and metrics. This allows it to filter out the signal from the noise – to distinguish between “noisy anomalies” and “real incidents” with details of root cause.
If you're interested, this blog provides examples of incidents that our ML can uncover (note that the screen shots in the blog are from the Zebrium native UI, rather than from Kibana).
And here's a one minute video that shows it in action:
Try it for free – you’ll be amazed at what it catches
Zebrium ZELK Stack uses the same incident and root cause detection technology that has been proven across hundreds of different tech stacks (here are some customer examples). It’s easy to setup and free to try.
NOTES ON TRADEMARK USAGE
- Elasticsearch is a trademark of Elasticsearch BV, registered in the U.S. and in other countries.
- Kibana is a trademark of Elasticsearch BV, registered in the U.S. and in other countries.
- Logstash is a trademark of Elasticsearch BV, registered in the U.S. and in other countries.
- Beats is a trademark of Elasticsearch BV.
- Elastic is a trademark of Elasticsearch BV.