Proactive RCA for Nebluon's mission critical cloud app

Challenge

The Nebulon cloud management plane, Nebulon ON, is a mission critical service, hosted on Amazon Elastic Kubernetes Service, that Nebulon’s customers rely on every day. The cloud-native Kubernetes architecture of Nebulon ON enables the Nebulon engineering team to rapidly iterate and frequently release enhancements to their end users. However, paramount to every release is reliability. Nebulon sought out a proactive monitoring platform that could not only detect known failure modes, but also catch new (previously unknown) failure modes and help to identify root cause faster.

Solution

Nebulon uses Zebrium for all development and production monitoring and log management tasks. Deployment of the Zebrium collectors in the Nebulon Kubernetes environment was completed effortlessly and the Zebrium platform continues to scale to meet Nebulon’s growing needs. No rules or special configuration were required.

Results

Zebrium ML incident detection saves Nebulon hours of engineering each time an incident occurs
Zebrium consistently generates relevant incidents and helps engineering find root cause with minimal effort
Zebrium ML event parsing allows Nebulon to use existing rich log events instead of retroactively adding custom metrics instrumentation

About Nebulon

Nebulon delivers Cloud-Defined Storage, cloud-managed SaaS enabling on-prem, server-based enterprise storage, which automates operations and eliminates 3-tier infrastructure. The solution is powered by a combination of a cloud-based control plane with IoT endpoints or PCIe cards inside a customers’ application servers. The solution is designed to provide all storage data services needed for enterprise-class applications, and does not consume any server CPU, memory or network resources.

A proactive approach to root cause analysis (RCA)

When it comes to building mission critical data storage solutions, there is zero tolerance for anything that could compromise reliability or resiliency. Fortunately, the seasoned engineering team at Nebulon knows exactly what it takes to build an enterprise-class solution which can be managed at-scale. So, when it was time choose a logging and monitoring platform, forefront in Nebulon’s mind was to find a solution that would let them take a more proactive approach to catching and solving software problems.

According to Mike Heyeck, Cloud Lead at Nebulon, “We needed something that could aggregate, store and make our logs searchable. But the key factors that led us to select Zebrium were its ability to automatically parse and categorize log events and then to provide automated incident detection”.

Categorization and ML parsing saves the day

Nebulon ON tracks a very large number of storage volume records and encountered a problem where all records were being updated instead of just a few. Since Zebrium had already structured all the log events and extracted embedded metrics, they were able to immediately see that the problem was happening once a day. Doing this without Zebrium would have required additional instrumentation and several days of waiting for a sufficient amount of instrumented data to be collected.

They surmised the problem was caused by a pre-condition that was being incorrectly set and used the Zebrium UI to see which pre-conditions events were correlated with the problem update log events. This allowed them to quickly pinpoint the problem and saved many hours of wasted engineering time.

“With Zebrium, just using the log statements we already had that described how many records were being affected, we were able to go and find the relevant events associated with this anomalous condition. We could then look at the logs around them and debug what the problem was. Without Zebrium we would have had to retroactively go in and add time series data and more traditional instrumentation. This would have added significant time and effort to debugging this problem”, said Mike Heyeck, “Zebrium ML parsing allows us to use logging in lieu of metrics instrumentation. We can simply add counters into log lines and have them immediately available through the Zebrium UI”.

Incident Detection

Zebrium’s machine learning has helped Nebulon detect and troubleshoot a variety of software incidents. In Mike Heyeck’s words, “Zebrium surfaces anomalies in system operation that would require a lot more work to detect otherwise.”

For example, Nebulon hit an issue after changing a set of certificates, where one deployment had not been updated to reference the new certificates. Zebrium detected this problem immediately and raised an incident which correctly identified the offending deployment.

On other occasions, Zebrium has detected incidents relating to database foreign key violations that caused database updates to fail. Since most of those were due to subtle race conditions, they would have typically been difficult to troubleshoot.

According to Mike Heyeck, “It used to be very tedious to find root cause for these race conditions. For example, if something happened at say 11:21am, we would have to manually look through all the logs around that time to find the relevant incident. Now, Zebrium incidents provide us with a nice summary of relevant log events that makes this process much faster.”

In summary, “We find Zebrium ML detected incidents useful and helpful. I would say Zebrium saves us hours per major incident, and in aggregate, has saved us days.”

Proactive RCA for a mission critical cloud app