Getting started

Our goal is to let you connect a new application in 2 minutes, and start reliably detecting software incidents within the first hour. All with zero configuration (and absolutely no pre-built rules or knowledge of the app)!

   

GETTING STARTED VIDEO

Within a minute you'll receive an email with your account credentials (if you don't see it, just your junk folder). When you first login, a screen will appear with customized instructions to install our log collector.

In Kubernetes this involves two Helm or  kubectl commands. The collector is lightweight, runs as a daemonset and is secure. Then, within a minute or two, your data will start appearing in the UI.

Watch this 1.5 minute video that shows you how.  

GETTING STARTED VIDEO

From the moment data arrives, our machine learning will begin structuring your logs and learning patterns. After that, just sit back and wait for it to detect incidents. 

Our goal is to achieve reliable incident detection within the first hour - with absolutely zero configuration required.

Zebrium log collectors are based on the popular Fluentd open source collector. You can find the docs here and the source code on the Zebrium Github repository. The log collectors have a lightweight footprint and are designed to stream logs from Kubernetes, Linux and CloudWatch. Details on how Fluentd collects Kubernetes meta-data can be found in this blog.

All Zebrium log collectors use an authentication token to securely deliver data to your account (your token is private and can be found in Settings / Log Collector). Please keep this token safe and do not share it with others. Log data is encrypted in transit from your network and also encrypted at rest within the Zebrium service.

In Kubernetes environments, the collectors will automatically discover application, operating system and Kubernetes logs. In Linux environments by default, the collector looks for logs in /var/log/*.log, /var/log/syslog, /var/log/messages and /var/log/secure. You can specify other locations by setting the ZE_LOG_PATHS environment variable (see here for details).

We adhere to industry best practices for security including: encryption of data in flight, AES-256 encryption of data at rest, optional granular removal of sensitive records or fields, secure isolation of customer data and option for a dedicated instance and VPC. All customer data will be deleted upon termination of service or by request. Details of our security policy can be found here. Users can also specify particular events to filter out, if they contain sensitive data. 

As soon as log streams are received by Zebrium, our machine learning will begin to automatically learn the structure and patterns in your logs and start detecting incidents. When an incident is detected, we will alert by email and by Slack (you will receive an invitation to join our Slack community the first time you log in). The learning process is fast and should start producing accurate results within an hour from first ingest being received.

The machine learning works in multiple phases:

• The ML learns how to structure and categorize every event. Although many apps produce millions or billions of log events per hour, they will typically only have a few thousand unique types of log events. All event variables are also extracted into columns which allows for very powerful structured queries and easy charting of any string or metric (as well as many other things). The schema is automatically maintained as event structures change.
• Next, the ML learns the patterns for each event type (frequency, periodicity, when it started, when it stops, etc.). When an event breaks pattern it is flagged as an “anomaly” and scored (depending on how anomalous it is).
• The log anomaly detection is too “noisy” to reliably detect incidents because individual events frequently break pattern. So the next phase of ML looks for correlated sets of anomalous events that occur across containers or log sources. If any of these correlations reach the right thresholds, an incident is created.
• When an incident is created, the “leading edge” is identified - the first anomalous event(s) that triggered the incident. This is useful to indicate the root cause of the incident.

 

Currently the ML is achieving incident and root cause identification for 2/3 of real incidents (this rate is improving as the models learn from more data). In order to help improve the ML-model, we rely on our users to rate the incidents that are created. This is extremely simple. Next to each incident, you will see three buttons:

• Like: Tells Zebrium this is a valid incident. Zebrium uses this to measure and improve the accuracy of the ML model.
• Mute: Tells Zebrium that this is a valid incident but that you don't want to be alerted on this type of incident. Future incidents of this type will not generate alerts and will be automatically put into the Muted list.
• Spam: Tells Zebrium this is an invalid incident. Zebrium uses this improve the ML model and reduce the number of false positive incidents. Future incidents of this type will be automatically put into the Spam queue.

By default, alerts are sent by email and are also posted in your private channel on the Zebrium Slack community.

If you would like to change email settings or send alerts to a Slack channel of your choice, please please login to the Zebrium portal and go to the Notifications page in the Settings menu. 

Please send questions via your Slack channel or by emailing hello@zebrium.com.