Automatically Catching Incidents and Root Cause in a Cloud Native Stack




Aran Khanna reserved_ai“Our cloud provider made an API change which caused problems downstream. Zebrium not only detected the issue, but also helped us debug it quickly”. 
Aran Khanna, CEO & Co-founder @ is on a mission to save cloud customers time and money! Their platform reduces cloud spend by monitoring usage, tracking waste and automating reservation optimizations in real-time.

Logs, Logs and More Logs has built a cloud native stack in AWS using Kubernetes, Flask, Celery, Redis, PostgreSQL, Airflow and React. Each instance of each component generates many log streams. Prior to Zebrium, relied on vi, grep, awk and custom scripts to analyze logs.

“Our clusters generate a stupid amount of logs which makes it really painful to find what you’re looking for! Whenever we had to troubleshoot an incident, we would spend hours manually going through logs.”, said Nikhil Khanna, CTO and Co-founder @

As a rapidly growing startup, realized that dealing with logs and metrics would become a bottleneck in their growth unless they quickly implemented a better solution. Chooses Zebrium

“We chose Zebrium because they provide a new approach to looking at logs. We loved the idea of using machine learning to uncover what we used to look for manually. And there was no friction to get going – about a minute after pasting one Helm command we were up and running.”

Within minutes of installing the Zebrium log and metrics collectors, saw value. Zebrium’s categorization of events by event type and log type made it really easy to see what was going on. And having the ability to see a timeline of activity by severity and custom filters provided a dimension to log exploration that was impossible with manual tools.

Their first Incident Saved Valuable Hours relies heavily on AWS APIs and were taken by surprise a week after first using Zebrium when there was an unexpected API change that would have caused a service disruption had it not been caught.

Fortunately, Zebrium picked up the incident automatically and correlated it with events from all the services that were impacted. The resulting auto-generated incident gave them exactly what they needed to quickly see the root cause of the problem.

“Our cloud provider made an API change which caused problems downstream. Zebrium not only detected the issue, but also helped us debug it quickly” - Aran Khanna, CEO & Co-founder @ Today first started using a beta version of Zebrium in February, 2020. They have continued to use and rely on it since. and it has saved them countless hours when troubleshooting bugs.

“If it wasn’t for Zebrium I would have had to spend several hours per bug manually digging through logs. Zebrium makes it so easy.”

Apart from the value Zebrium has added uncovering incidents and root cause, they also love the log exploration capabilities including: being able to see a traceback and all related info, seeing heatmaps of particular event types that are important to them, and easy charting of event variables.

“With Zebrium, our service has become more reliable and we are now able to spend far more time building features instead of hunting through logs to debug problems.”, concluded Nikhil Khanna.