We're thrilled to announce that Zebrium has been acquired by ScienceLogic!

Learn More

Observability: It's Time to Automate the Observer

June 15, 2022 | Larry Lancaster

Application monitoring is experiencing a sea-change. You can feel it as vendors rush to include the phrase "root-cause" in their marketing boilerplate. Common solutions enhance telemetry collection and streamline workflows, but that's not enough anymore. Autonomous troubleshooting is becoming a critical (but largely absent) capability for meeting SLOs, while at the same time it is becoming practical to attempt.

Application monitoring is experiencing a sea-change. You can feel it as vendors rush to include the phrase "root-cause" in their marketing boilerplate. Common solutions enhance telemetry collection and streamline workflows, but that's not enough anymore. Autonomous troubleshooting is becoming a critical (but largely absent) capability for meeting SLOs, while at the same time it is becoming practical to attempt. This profound transformation is the inevitable consequence of a few clear trends:

1.) Per-Incident Economics (the Motive) - Production incidents capable of impacting thousands of users in a short period of time are now commonplace. It's not enough anymore just to automate incident response based on lessons learned from the first occurrence of a new kind of problem, as was common in the shrink-wrap era, since the first occurrence alone can be devastating. These economics provide the motive for automating the troubleshooting step.

2.) Analytic Technologies (the Means) - It used to be cost- and effort-prohibitive to characterize and correlate metric, trace and log data, at scale, and in near-real-time. Ubiquitous access to fast storage and networks, as well as steady development of OLAP technologies and unsupervised learning algorithms, give us the means to address gaps with automation.

3.) The Troubleshooting Bottleneck (the Opportunity) - Runtime complexity (C) and operational data volume (V) continue to grow. The human eyeball is the bottleneck for troubleshooting, and it doesn't scale. As C and V grow linearly, MTTR for new/unknown/complex issues grows quadratically (~CV). This burgeoning time sink gives us the opportunity to tangibly improve troubleshooting with automation, and with ever-growing benefits into the future.

Root Cause as a Service (RCaaS)

Because of these trends, we believe it's time for a generally useful, generally applicable RCaaS tool, and we believe we have built one. Zebrium delivers RCaaS, and here's what we mean by that: It’s proven (we’ll explain this later) and it delivers a fast-and-easy RCA experience, wherever and however you want it.

We believe that an autonomous troubleshooting tool should work in the general case, out of the box, stand-alone or in tandem with any other observability vendors' tools, and without exotic requirements (rules, training, etc.). The solution should be agnostic to observability stack or ingest method, and it shouldn't make assumptions about what stack you run or how you run it.

We've Started with Logs

In any journey, you have to start somewhere. The founder of a well-known tracing company once said: "metrics give you when; traces give you where; logs give you why (the root cause)". It's not always true but, as a rule-of-thumb, it's not bad. Here's another, universally-heard rule-of-thumb: digging through logs to root-cause a new, unknown issue is one of the most dreaded experiences in DevOps today.

We believe if it has to do one thing well, an autonomous troubleshooting tool should find the same root-cause indicators from the logs that you were going to have to dig to find yourself. The solution should have first-class support for generic and unstructured logs, and it shouldn't require any parsers / alert rules / connectors / training / other configuration to work well.

We've Done the Hard Stuff

Supporting generic, unstructured logs by inferring their types and parameters correctly behind-the-scenes, is hard. Learning metadata from scratch, at ingest, custom to a particular deployment, is hard. Correlating anomalies across log streams to formulate RC reports, is hard. Summarizing such reports, is hard. These are all incredibly hard problems - but they were, in our view, necessary to accomplish generally useful, autonomous troubleshooting.

Why You Should Trust Us

Vendors have ruined the playing field by hyping "AI" / "ML" tools that don't work very well. Why should you trust that our tool can add value? Well, vendors generally don't present large-scale, quantitative, third-party studies of their tools' effectiveness in real-world scenarios across multiple stacks. We believe such studies are important criteria for buyers selecting tools, and we have such results to share with you.

Cisco Systems wanted to know if they could trust the Zebrium platform before licensing it. They ran a multi-month study of 192 customer incidents across 4 very different product lines (such as Webex client and UCS server, among others). These incidents were chosen because they were the most difficult to root-cause, because they were solved by the most senior engineers, and because their root-cause was inferable from the logs.

Cisco found that Zebrium created a report at the right time, with the right root-cause indicators from the logs, over 95% of the time. You can read more details about this study here.

Aside from them, we have many satisfied customers, from petascale SaaS companies deploying into multiple GEOs with K8s, to MSPs monitoring windows farms, to enterprises troubleshooting massive production database applications.

Come on this Journey with Us

We've built something very special here. We're not trying to bamboozle you. We have real evidence from the real world that shows our tech works. We've built the first credible, accurate, third-party-proven tool that autonomously delivers root-cause from logs to the dashboard.

Want to run Zebrium in the cloud? We can do that. Want to run it on-prem? Our stack can be deployed on-prem with a chart. Want to monitor a modern cloud-native K8s environment? We have a chart for that too, and customers running K8s clusters with hundreds of nodes in multiple GEOs. We also support ingest via Fluentd, Logstash, Cloudwatch, Syslog, API and CLI, and we’re happy to expand our offerings to support our customers’ needs.

With support for dashboards from Datadog, New Relic, Dynatrace, Elastic/Kibana, Grafana, ScienceLogic, and AppDynamics, we’ll get you up-and-running with autonomous RCA feeding right into your existing monitoring workflow.

Sign up for a free trial from https://www.zebrium.com, or send us an email at hello@zebrium.com to arrange a purchase or PoC.

 

 

 

Read More

Log Anomaly Detection Using Machine Learning

June 21, 2021 | Larry Lancaster
At Zebrium, we have a saying: “Structure First”. We talk a lot about structuring because it allows us to do amazing things with log data. But most people don’t know what we mean when we say the “structure”, or why it is a necessity for accurate log anomaly detection.

At Zebrium, we have a saying: “Structure First”. We talk a lot about structuring because it allows us to do amazing things with log data. But most people don’t know what we mean when we say the “structure”, or why it is a necessity for accurate log anomaly detection.

Read More

Using GPT-3 for plain language incident root cause from logs

January 9, 2021 | Larry Lancaster

This project is a favorite of mine and so I wanted to share a glimpse of what we've been up to with OpenAI's amazing GPT-3 language model. Today I'll be sharing a couple of straightforward results.

 

Plain Language root cause summaries. Try it for free!
GET STARTED FREE

 

This project is a favorite of mine and so I wanted to share a glimpse of what we've been up to with OpenAI's amazing GPT-3 language model. Today I'll be sharing a couple of straightforward results. There are more advanced avenues we're exploring for our use of GPT-3, such as fine-tuning (custom pre-training for specific datasets); you'll hear none of that today, but if you're interested in this topic, follow this blog for updates.

 

You can also see some real-world results from our customer base here.

Read More

Virtual tracing: A simpler alternative to distributed tracing for troubleshooting

July 21, 2020 | Larry Lancaster

Distributed tracing is commonly used in Application Performance Monitoring (APM) to monitor and manage application performance, giving a view into what parts of a transaction call chain are slowest. It is a powerful tool for monitoring call completion times and examining particular requests and transactions.

The promise of tracing

Distributed tracing is commonly used in Application Performance Monitoring (APM) to monitor and manage application performance, giving a view into what parts of a transaction call chain are slowest. It is a powerful tool for monitoring call completion times and examining particular requests and transactions.

Read More

Is Autonomous monitoring the anomaly detection you actually wanted?

April 15, 2020 | Larry Lancaster

Automatically Spot Critical Incidents and Show Me Root Cause

That's what I wanted from a tool when I first heard of anomaly detection. I wanted it to do this based only on the logs and metrics it ingests, and alert me right away, with all this context baked in...

Automatically Spot Critical Incidents and Show Me Root Cause

Read More

Deploying into Production: The need for a Red Light

July 23, 2019 | Larry Lancaster

As scale and complexity grow, there are diminishing returns from pre-deployment testing. A test writer cannot envision the combinatoric explosion of coincidences that yield calamity. We must accept that deploying into production is the only definitive test.

Read More

Structure is Strategic

October 31, 2018 | Larry Lancaster

We structure machine data at scale

Zebrium helps dev and test engineers find hidden issues in tests that “pass”, find root-cause faster than ever, and validate builds with self-maintaining problem signatures. We ingest, structure, and auto-analyze machine data - logs, stats, and config - collected from test runs.

 

We structure machine data at scale

 

Read More

Featured Posts

FREE SIGN-UP