Developers and testers constantly use log files and metrics to find and troubleshoot failures. But there are some challenges which we’ll discuss across a series of blogs:
- Extracting useful information from logs requires wrangling, regexes and parsing scripts (covered in this blog)
- Creating problem signatures that definitively detect known issues is really hard
- Troubleshooting by keyword searching in logs is imprecise
- How do you ensure log parsing scripts work across evolving software versions?
Extracting useful information from logs requires wrangling, regexes and parsing scripts
Here’s a typical log line in /var/log/system.log on my Mac that contains useful information:
May 9 05:35:12 Gavins-MacBook-Pro com.apple.xpc.launchd (com.apple.sandboxd): Service exited due to SIGALRM | sent by kernel_task
Now, let’s say you want to build a report that shows which services exited, and for what reasons. For example, something like this:
. . . . . .
. . . . . .
First you need to understand the structure of log lines like the one above so that you can get the data needed for the report. With a bit of inspection (and some guessing) you might come up with (underlined CAPS represent parameters):
MONTH DAY HH:MM:SS HOST PROCESS1[PID1] (SERVICE[PID2]): Service exited due to SIGNALTYPE | sent by PROCESS2[PID3]
You also need to be able to uniquely identify all occurrences of this “event type” in order to effectively parse out the data. This can be done by searching for some identifying text, for example “Service exited”. I tried this with grep and it turned out to be too general as it also matched other similar events like this one:
May 8 13:39:49 Gavins-MacBook-Pro com.apple.xpc.launchd (com.apple.WebKit.Networking.A4992C6F-8F69-4EA9-A031-76B032FB964F): Service exited with abnormal code: 1
Changing the search term to “Service exited due to” seemed to solve the problem, but I only tested this on a small log file, so it’s possible that there are other event types that would also match this term. The important point here is that you need a way to uniquely identify an event type or your parsing will generate the wrong the data.
Next you need to write a program or script that finds the relevant log lines, strips out REASON and SERVICE and outputs the results. To make the report more readable, you also need to filter out duplicate lines (my grep showed many) and sort by exit reason and sub-sort by SERVICE. Here’s a way you could do this:
- Find all matching events
- Extract REASON and SERVICE by using regexes & parsing out the right data
- Sort the results by REASON
- For each reason, sub-sort by SERVICE
- Print out a line with REASON and SERVICE and suppress any following lines that have the same REASON and SERVICE
Although the above does not involve rocket science, it does require figuring out the right regexes and parsing expressions together with decent coding/scripting skills. But that’s not all.
If the event is from your own application, you’d also have maintain it across versions of software. It’s quite common for text or parameters to be added to existing event types, or for new similar events to be added, both of which can break your parsing. The bigger problem is that you may never even find out that your script stops finding intended matches.
And then the clincher. The above discussion is about just one event type. Most applications have thousands of different event types – each with their own structure. Although you might not need to mine every event type, the times you need to are often the least convenient (e.g. your bug might be holding up a critical build or you might get dragged into a customer escalation).
Zebrium makes it dead simple
Zebrium uses unassisted machine learning to:
- Identify all unique event types (typically thousands per application)
- Build a schema for the event type (structuring fixed text and variable parameters)
- Place each event type into a view where each parameter has its own typed column
All of this is built for simple query and is self-maintained as log file structure changes.
I dragged the log file referenced (/var/log/system.log) into the Zebrium UI. Using just this file, and with no pre-learning or data wrangling, our machine learning performed its magic:
- It identified 500+ unique event types and created a view for each
- It built an index, and categorized each event type by topic, which made it really easy to find what I was looking for (it was named “v_service_exited_due_sent_by”)
- And then one line of SQL produced the report (it really is one line, but I’ve split it into 3 to make it more readable below J). Note you could also have gotten at this data through our UI without writing any SQL.
SELECT DISTINCT _to AS Reason, _str AS Service
Reason | Service
SIGALRM | com.apple.sandboxd
SIGKILL | QA2G25RMZ4.com.wunderkinder.wunderlist-helper
SIGKILL | com.apple.AirPlayUIAgent
SIGKILL | com.apple.LocalAuthentication.UIAgent
SIGKILL | com.apple.OSDUIHelper
SIGKILL | com.apple.SystemUIServer.agent
SIGKILL | com.apple.UserEventAgent-Aqua
SIGKILL | com.apple.ViewBridgeAuxiliary
It took no time at all. I didn’t have to think about regexes, parsing, writing code, etc. I didn’t have to prepare anything in advance or know ahead of time what I wanted to query. With no pre-learning and a tiny data set (58K lines), Zebrium ML structured the entire log file. And with just one line of SQL I produced the report I wanted!
But wait, there’s more!
I was going to end the blog here but decided to do one more thing! I wanted to see if any of the services that exited due to a signal, also exited with an abnormal exit code (remember the second event referenced above “Service exited with abnormal code”). Well, you guessed it, this can also be done with one line of SQL. All I had to do was a join across the views for the two event types. Here’s the result:
SELECT DISTINCT _to AS Reason, a._str AS Service, _code
) B ON A._str = B._str;
Reason | Service | _code
SIGKILL | com.apple.ViewBridgeAuxiliary | 2
If you’re a developer or tester, register for a free trial at www.zebrium.com.