Reliable signatures to detect known software faults

May 22, 2019 | Gavin Cohen
Never troubleshoot the same problem twice

Have you ever spent time tracking down a bug or failure, only to find you’ve seen it before? Or a variation of this problem: at the completion of automated test you have to spend time triaging each failure, even though many are caused by the same bug. All this can impact productivity, especially in continuous integration and continuous deployment (CI/CD) environments, where things change rapidly.

 

There is a solution to this – when an issue occurs, build a “signature” that looks for the specific patterns in log files and metrics that are present when the problem occurs.

 

Most signature implementations involve writing scripts that parse data, using regular expressions (regexes) and standard Unix commands like grep and awk.

 

Here are a few examples:

 

In most current (non-Zebrium) implementations, signatures are built by writing scripts that parse data using regular expressions (regexes) and standard Unix commands like grep and awk. Here are a few typical examples:

 

  • Very simple: Trigger when the event “Host: linux01 not found” occurs, irrespective of which host name:

    • if `grep -q "Host.*not found"`; then echo Triggers; fi

  • Simple:  Trigger whenever the event “Memory utilization reached XX% - attention needed” occurs, if XX% is greater than 95%:

    • awk 'match ($0, /Memory utilization reached.*- attention needed/) {if ($4+0 > 95) {print "Triggers"}}'

  • It quickly gets complicated: Trigger when the event “Host: linux01 not found” occurs, irrespective of the host name, but only if it’s followed within 10 minutes by the event “Memory utilization reached XX% - attention needed” and if XX% is greater than 95%.

  • And even harder when conditions span log files: First look in log file1 and find when event type 1 occurs within 5 minutes of event type 2, then find the most recent value of the transaction ID that was set prior to event 1. Now trigger if an event type 3 is found in log file2 with the same transaction ID, as long as it occurs within the timespan between events 1 and 2.

  • To make matters worse: Let’s say you successfully build a bunch of scripts. A small log line change in a new software build could mean known issues are completely missed.

  

What we’ve seen across many customers is that practical challenges make it very hard to build and deploy automated signature detection broadly, except a small number of “low-hanging fruit” issues. This is because:

 

  • It requires specialist skills and considerable time to develop signatures, so only a small subset of engineers (sometimes in a dedicated team) build them.

  • After signatures have been created, it takes additional time to test, refine and maintain them across product releases.

  • Limited resources mean there will always be a long queue of known problems that do not yet have signatures.

  • Beyond the challenges of creating signatures, it takes a sizable investment in people, tools, data pipelines, processes and infrastructure to maintain an automated signature management capability.

 

Zebrium CI/CD Forensics

 

A key part of our platform is built around making it simple for developers and testers to create deterministic signatures. Our goal is to make it so easy and fast that it becomes part of the CI/CD process. Here’s how it works:

 

Once a problem has been solved, the user selects the events and/or metrics that characterize the issue (this takes a few clicks in the UI). Since we use machine learning to perfectly structure log lines, we have been able to implement a “Signature Builder” suggests a definition based on the events selected, their relative sequence and timing, and values in the parameters (variable parts) of the events.

 

Here’s an example of what it looks like:

Never troubleshoot the same problem twice

In this example, the signature will only trigger if three events (A, B and C) are found and the following conditions are met:

  1. Event A occurs first, followed by events B and C.

  2. The three events A, B and C match the exact form shown in the picture above (note the variable parts %s and the fixed text).

  3. All three events must occur within 60 seconds.

  4. Specific variables in events A and B must match specific text (Stopped and ServiceRunner)

 

It looks complicated, needs expertise to create, took a long time to create, right?

 

You’d be surprised! It took less than 30 seconds and just a few clicks of the mouse. In fact, just by selecting the 3 events, the signature builder did most of the work. But what’s really impressive is the rliability of the signature that’s been created – the event structures and ongoing schema management (since event structures can change across releases) is automated by our ML.

 

Please visit here to pre-register for beta access.

 

Tags: predictive troubleshooting