Never troubleshoot the same problem twice

March 18, 2019 | Gavin Cohen
Never troubleshoot the same problem twice

Part one of this blog (Troubleshooting the easy way) explains Zebrium’s new paradigm for troubleshooting technical issues. Instead of dealing with log files and metrics, we automatically structure, analyze and visualize the data so that problems are represented as colored bands in a heatmap. This quickly guides a user to the root cause without having to hunt for a needle in the haystack.

 

 

Never troubleshoot the same problem twice

But this is only half of the Zebrium solution.

 

The second half is making sure that once an issue has been resolved, future occurrences are guaranteed to be automatically detected. This highly desirable goal has proved to be extremely challenging for technology vendors.

 

To set the context, we are not talking about chatbots or knowledge base tools that attempt to identify a problem by the way it’s described in words. Rather, we’re talking about a deterministic approach that identifies a problem based on its exact machine data signature – i.e. how the problem exhibits itself in log, metrics and config files. This second approach is the panacea for technology vendors because it means once a problem has been solved once, it never needs to be manually solved or identified again.

 

Building problem signatures

 

In most current (non-Zebrium) implementations, signatures are built by writing scripts that parse data using regular expressions (regexes) and standard Unix commands like grep and awk. Here are a few typical examples:

 

  • Very simple: Trigger when the event “Host: linux01 not found” occurs, irrespective of which host name:

    • if `grep -q "Host.*not found"`; then echo Triggers; fi

  • Simple:  Trigger whenever the event “Memory utilization reached XX% - attention needed” occurs, if XX% is greater than 95%:

    • awk 'match ($0, /Memory utilization reached.*- attention needed/) {if ($4+0 > 95) {print "Triggers"}}'

  • It quickly gets complicated: Trigger when the event “Host: linux01 not found” occurs, irrespective of the host name, but only if it’s followed within 10 minutes by the event “Memory utilization reached XX% - attention needed” and if XX% is greater than 95%.

  • And even harder when conditions span log files: First look in log file1 and find when event type 1 occurs within 5 minutes of event type 2, then find the most recent value of the transaction ID that was set prior to event 1. Now trigger if an event type 3 is found in log file2 with the same transaction ID, as long as it occurs within the timespan between events 1 and 2.

  

What we’ve seen across many customers is that practical challenges make it very hard to build and deploy automated signature detection broadly, except a small number of “low-hanging fruit” issues. This is because:

 

  • It requires specialist skills and considerable time to develop signatures, so only a small subset of engineers (sometimes in a dedicated team) build them.

  • After signatures have been created, it takes additional time to test, refine and maintain them across product releases.

  • Limited resources mean there will always be a long queue of known problems that do not yet have signatures.

  • Beyond the challenges of creating signatures, it takes a sizable investment in people, tools, data pipelines, processes and infrastructure to maintain an automated signature management capability.

 

Zebrium brings signatures to everyone

 

Our goal is to make it simple for anyone in the support and engineering teams to create signatures. In fact, we want it to be so easy and fast that it becomes part of their everyday routine. Here’s how it works.

 

Once a problem has been solved (see Troubleshooting the easy way), the user starts by selecting the events and/or metrics that characterize the issue (this takes a few clicks in the UI). From here the “Signature Builder” automatically suggests a definition based on the events selected, their relative sequence and timing, and values in the fields that comprise those events.

 

It’s worth watching a short video that demonstrates the signature building process. The example that follows comes from the video and shows what the signature builder created after selecting three events.

Never troubleshoot the same problem twice

In this example, the signature will only trigger if three events (A, B and C) are found and the following conditions are met:

  1. Event A occurs first, followed by events B and C.

  2. The three events A, B and C match the exact form shown in the picture above (note the variable parts %s and the fixed text).

  3. All three events must occur within 60 seconds.

  4. Specific variables in events A and B must match specific text (Stopped and ServiceRunner)

 

It looks complicated, needs expertise to create, took a long time to create, right?

 

You’d be surprised! It took less than 30 seconds and just a few clicks of the mouse. In fact, just by selecting the 3 events, the signature builder did most of the work.

 

But that’s not all! You can now test the signature against all available machine data and see a summary of how many times it triggered and which customers/instances it triggered on. You can then easily refine or relax the conditions before saving it.

 

Now the next time it happens, you will automatically be alerted!

 

If it sounds too good to be true, try it yourself

 

Please visit www.zebrium.com and sign up for a demo or trial.

 

Tags: predictive troubleshooting