“After trying this, I can say only one thing - what SRE wouldn’t want to use Zebrium? It finds the root cause of our problems automatically, and your integration with Kibana is beautiful.”
Bala Sista, Solutions Architect, Personal.ai
Personal.ai brings the creative potential of AI to individuals. A lot of the information (as much as 80%) we as individuals receive is lost because humans can’t retain everything that comes in daily. Personal.ai helps connect the dots across a user’s various communication channels and interests. Their software taps into your social media, blogs, texts, emails etc., and builds a personalized module for each user to create, collaborate and monetize their work.
Personal.ai uses a modern cloud native software stack, but one that is highly customized to each user which makes it quite complex. Multiple interfaces are supported, including a web interface and a desktop app. Co-founder and CTO Sharon Zhang explained – “as compared with most software stacks, each user essentially gets a custom pipeline (segregated from all other users), depending on the inputs and sources they use.”
The architecture is microservices based, with over 45 microservices currently in use. And they iterate rapidly, deploying updates as often as 30 time a day, and generally at least 4-5 times a day.
The company launched at Product Hunt last fall, and immediately saw a surge in user signups which led to scaling challenges and the discovery of new bugs - a growing challenge for the small engineering team. Personal.ai had tested multiple observability tools, and ultimately settled on a self-hosted Elastic Stack for Observability, with dashboards keeping track of various metrics, traces and logs, and PagerDuty for incident management.
However, the combination of cutting-edge AI, highly personalized pipelines and rapid iteration cycles meant that when something went wrong, it was usually very painful to track down and troubleshot. Pods would come and go depending on each user’s custom pipeline, and some problems were user specific making them very difficult to debug.
Particularly early on, there were multiple P1 issues daily, and any challenges in troubleshooting meant rolling back an entire deployment rather than deploying a targeted fix. Some issues were taking hours, or days to debug. All of these problems combined were significantly impacting the pace of new development.
During this early painful phase, the engineering team leader, Bala Sista, happened to stumble upon Zebrium in a blog post. He signed-up for a free trial of Root Cause as a Service (RCaaS), and immediately saw its promise. He provided some feedback on the UI and integrations, and found the Zebrium team very engaged and quick to react to user feedback. This experience, and Zebrium’s brand new integration with the Elastic Stack, gave him the confidence to become a paying customer.
Bala found Zebrium greatly benefited the troubleshooting experience. Before Zebrium, Personal.ai had to drill-down into approximately 40 dashboards, then look for errors in logs, and manually correlate all the pieces.
With Zebrium, the root cause reports do all of this for him automatically. And the best part is that RCaaS is integrated with the Elastic Stack which means Personal.ai can see the details of root cause found by Zebrium right in context of a Kibana dashboards. This makes it dead easy to line up the root cause with other symptoms and information to get a full picture of the problem.
The number of P1 issues has dropped, and the time to root cause has been reduced by 60%, which has resulted in freeing up countless hours for engineering teams.
Even better, the root cause details found by Zebrium are very targeted which means narrow fixes can be quickly rolled out vs rolling back entire deployments. For example, Personal.ai recently hit an issue with null references in one part of the code which Zebrium detected and root caused quickly, enabling a quick and precise fix.
The net results of using Zebrium RCaaS is an improved customer experience, less wasted engineering time and faster software cycles.
The Personal.ai engineering team plans to expand into using GPU resources, and use Zebrium to help identify and root cause any issues that arise there. They also intend to enable the Zebrium integration with PagerDuty to automate the entire root cause worfklow.
Bala really likes the care and attention from the Zebrium team, but loves the technology. In his words – “After trying this, I can say only one thing - what SRE wouldn’t want to use this? It finds root cause to our problems automatically, and your integration with Kibana is beautiful”.