Interested in seeing GPT-3 plain language root cause summaries? Request beta access.

Read More

How It Works

ML-Powered Autonomous Root Cause Analysis

Zebrium - how it works-1

Logs and Metrics go In, Incidents and Root Cause Come Outfour simple steps-3

Step 1 - Ingest and Categorization

Install our Fluentd log and our optional Prometheus metrics collector, or fork a copy of your logs using Logstash.  No parsers, code changes, rules or config are needed. Then let our Machine Learning (ML) take over!

Within minutes, our ML learns the structures of your logs,  and categorizes each event into a “dictionary” of unique event types. Categorization is crucial for accurate learning of the patterns in your logs and metrics.

Log and metrics collector setup

Zebrium machine learning

Step 2 - Pattern and Anomaly Detection

Within the first hour, the patterns of every log event and metric are learnt (and the learning continues to improve as more data is seen). 

When log or metric patterns change (e.g. change in periodicity or frequency, new/rare message starts, etc.), our ML detects these as anomalies - but this alone is not enough. In order to separate signal from noise, it then looks for hotspots of abnormally correlated anomalies across both metrics and logs. 

Step 3 - Augment (optional)

If you use an Incident Management tool like PagerDuty, Opsgenie or Slack, or an existing log management or monitoring tool, Zebrium can augment any incident with a characterization of root cause.

A signal is sent to Zebrium when an incident occurs. Or you can trigger a signal from the Zebrium UI. Zebrium then finds any  ML-incidents or sets of anomalous log/metric patterns that coincide with the signal, and automatically feeds the information back to your incident management tool.

Read more here: You've Nailed Incident detection, what about Incident Resolution.



PagerDuty with Zebrium to augment incidents with root cause


2 - Zebrium Autonomous Incident

Step 4 - Incident reports

The hotspots detected in the steps above are packaged into human readable incident reports. Incident reports make it easy for a user to clearly see root cause indicators as a correlated set of anomalous log events and/or metrics that .

The entire process is completely autonomous - without requiring manual configuration,  user-defined thresholds or alert rules. 

Getting started is free and easy

Spend just two minutes of your time and you'll be amazed at what we detect!