Monitoring today is extremely human driven. The only thing we’ve automated with monitoring to date is the ability to watch for metrics and events that send us alerts when something goes wrong. Everything else: deploying collectors, building parsing rules, configuring dashboards and alerts, and troubleshooting and resolving incidents, requires a lot of manual effort from expert operators that intuitively know and understand the system being monitored.
Logs are the source of truth when trying to uncover latent problems in a software system. But is searching logs to find the root cause the right approach?
Logs are the source of truth when trying to uncover latent problems in a software system. They are usually too messy and voluminous to analyze proactively, so they are used mostly for reactive troubleshooting once a problem is known to have occurred.
Kubernetes makes it easy to deploy, manage and scale large distributed applications. But what happens when something goes wrong with an app? And how do you even know?
Kubernetes makes it easy to deploy, manage and scale large distributed applications. But what happens when something goes wrong with an app? And how do you even know? We hear variations on this all the time: “It was only when customers started complaining that we realized our service had degraded”, “A simple authentication problem stopped customers logging. It took six hours to resolve.”, and so on.
Software log messages are potential goldmines of information, but their lack of explicit structure makes them difficult to programmatically analyze. Tasks as common as accessing (or creating an alert on) a metric in a log message require carefully crafted regexes that can easily capture the wrong data by accident (or break silently because of changing log formats across software versions). But there’s an even bigger prize buried within logs – the possibility of using event patterns to learn what’s normal and what’s anomalous.
Why understand log structure at all?