A few weeks ago, Larry our CTO, wrote about a new beta feature leveraging the GPT-3 language model - Using GPT-3 for plain language incident root cause from logs. To recap – Zebrium’s unsupervised ML identifies the root cause of incidents and generates concise reports (typically between 5-20 log events) identifying the first event in the sequence (typically the root cause), worst symptom, other associated events and correlated metrics anomalies.
As Larry pointed out, this works well for developers who are familiar with the logs, but can be hard to digest if an SRE or frontline ops engineer isn’t familiar with the application internals. The GPT-3 integration allows us to take the next step – distill these root cause reports down to concise natural language summaries by scanning the entire internet for occurrences of a similar incident, and extracting brief “English” descriptions for a user to scan.
After a few weeks of beta testing this feature with a limited group, and examining results from a couple of hundred incidents, we’re now ready to share some exciting results and expand access to ALL Zebrium users, even those on free trials.
In a nutshell – it works so well and in such a wide range of scenarios that we felt most users would benefit from having access to it. These summaries are both accurate and truly useful – distilling log events into a description a frontline or experienced engineer can easily understand.
This is still an early-stage feature for us, and there are cases where GPT-3 veers into guesswork and suggests summaries that seem related to the core RCA report, but aren’t exactly right. To make sure users know this, we tag the summaries with an “EXPERIMENTAL” badge in the UI.
There are also times the specific RCA report does not generate a particularly illuminating natural language summary beyond recapping the key log event(s). For instance –
There are several possible reasons for these suboptimal outcomes. One possibility is that there simply aren’t enough examples of that type of issue in the public domain, so GPT-3 is responding with the closest details it can find. Another is that we haven’t yet explored all the variants of prompts and options we can use with the GPT-3 model.
The good news is that even when results are suboptimal, they are mostly not misleading and are easily ignored. More importantly, our ML-generated root cause summaries are the perfect input source for GPT-3, and with more work, the outcomes will only get better from here.
The great news is that it actually works well more often than not, and the results are actually quite useful. Here are some examples where the GPT-3 summary really described the event collection accurately, and was really helpful to the user to quickly digest the RCA. Note: we have obfuscated details that might be potentially sensitive, and we’re not sharing the raw log events for the same reason, although they would be useful to compare alongside the summaries.
As a first bucket, here are some interesting and useful incident summaries related to memory starvation:
Then, here are some other infrastructure related incidents:
For variety, here are some database related incidents:
Finally, here are some examples of security related incident summaries:
Our focus is to cut troubleshooting time using machine learning to summarize the key event sequences that describe an incident based on logs and associated metrics anomalies. The GPT-3 integration is a big step towards our goals – enabling quick review of RCA reports by anyone, even personnel who may not be intimately familiar with application internals. As described above – there are still improvements to be made, but it works so well in real world scenarios that we are now opening it up to all our users.