"As the sphere of understanding grows ever larger, necessarily the surface area of ignorance gets ever bigger." -- Dennis McKenna
Stare at the Abyss
As scale and complexity grow, there are diminishing returns from pre-deployment testing. A test writer cannot envision the combinatoric explosion of coincidences that yield calamity. We must accept that deploying into production is the only definitive test.
Embrace the Unknown
What now? Nihilistic dirges are unhelpful. If we can't avoid it, then we must embrace it... or at the very least, plan for it. If we're willing to assume that an unknown problem might deploy and surface in production, how can we prepare?
Depending on our stack, we might be able to deploy with feature flags. Feature flags let us organize production users into A/B/X, and give us a very simple way to try newly-deployed code and turn it off. Alternatively, we might canary into A/B/X by node, or by pod, and have an automated rollback procedure.
Observe the Brokenness: a "Red Light"
We still need a way to tell that A is broken: a "red light", of sorts. Instrumentation plays a role here, but we would like our red light to be as general as possible. Just as a tester can't test for every unknown, a developer can't instrument for every unknown. It's tempting to think that as new failure modes are found, we can instrument for them, and that should be enough.
Unfortunately, as the size of our code base grows ever larger, the surface area of unknowns gets ever bigger. We will never catch up with manual effort and reactive improvements alone.
Make Use of What You've Got
We would like to use all the clues and cues available to feed our red light. Let's see what some recent RCAs might offer as ways to think about detecting un-instrumented problems.
Detect Events that Stop Happening
Stripe recently had an outage due to database bugs, combined with a configuration change (https://stripe.com/rcas/2019-07-10). The thoughtful RCA shows that a problem went undetected for some time since DB nodes were responding as up but had stopped sending their replication metrics.
Here, attention might have been drawn to the problem much earlier by viewing these updates as a train of roughly periodic events that stopped happening. This highlights the importance of being able to tell that something regular, stopped happening.
Don't Maintain Regexes (and Do Use Metrics)
This example highlights the difficulty of creating, curating, managing, and maintaining regexes (the same difficulty that keeps a lot of folks from leveraging the full power of their logs and event streams). It also shows that monitoring metrics is obviously important.
Do Use Logs
Honeycomb recently had an outage due to a missing binary (https://www.honeycomb.io/blog/incident-review-you-cant-deploy-binaries-that-dont-exist/). Long story short, the buildevents tool regressed, and so didn't exit with nonzero code despite build errors.
Here, I imagine that noticing new/exceptionally rare build events occurring in the build logs themselves could have provided a red light, perhaps before deployment. But without automatically structuring events and building a dictionary of event types (like Zebrium provides), it would be impossible to do this reliably.
Zebrium's mission is to be the best "red light" possible for production deployments. We acknowledge and embrace the importance of instrumentation, but we insist that an automatically structured understanding of logged events and incidental metrics is required to complete the mission. It's clear that multiple data sources are required, as is higher-level learning and machine interpretation of patterns in such data.
We are looking for forward-thinking beta testers and design partners to work with us in realizing the "red light" vision.