Saving Time by Auto-Detecting Incidents Across a Vast Range of Logs




Yosef Deray - lead software engineer-1“We were getting very odd error messages from one of our processes. Soon after installing Zebrium, it detected an incident that let me quickly figure out what the actual issue was.” – Yosef Deray, Lead Software Engineer, Iralogix

Iralogix is a Financial Technology Company based in Pittsburgh, PA. that provides a SaaS solution to manage IRA holdings for their customers. Their secure environment is hosted in Amazon Web Services (AWS) cloud and they use Amazon Elastic Container Service (ECS) for container orchestration. The majority of their applications are written in Java and produce a lot of logs!

Challenges with logs

Yosef Deray, one of the lead software engineer came across Zebrium when he read an article about finding anomalies in logs by using machine learning to automatically structure them. Although Iralogix already had a log management solution in place and used Amazon CloudWatch to collect the logs, he was frustrated by the manual effort involved with searching and browsing across multiple different logs to find details of problems and root cause. After reading the article, he was intrigued by the possibility of using ML to help with this.

At the time, the Zebrium product was still in beta so he decided to test it on some logs generated by one of their QA servers.

Zebrium quickly uncovered the root cause of a security incident

“Recently, we had started getting very odd error messages from one of our processes saying: ‘application x does not have permission to do y’. But what was odd is that there is no code in our software that would ask that process to trigger that action”. Suspecting that maybe requests were being mixed up, they would hunt through logs trying to find the offending requests but were not seeing signs that that was the issue. This continued on and off.

Soon after sending Zebrium the first log streams from their QA server, he was amazed to see what came up.

“Soon after installing Zebrium it detected an incident. Because of what was in the incident and the logs that it correlated across, it allowed me to quickly figure out what the actual issue was. It turned out to be a potential security hole which we quickly fixed.”

The product continued to improve

Although Yosef encountered some bugs and usability issues during beta, Zebrium was able to quickly address them and even implemented several new features based on his suggestions. Based on the good results, Iralogix expanded use from QA to production as well.

“It’s been really great dealing with Zebrium. In fact, I don’t think I would have been comfortable switching to you guys if you weren’t as responsive and helpful in addressing the issues that we had”.

Autonomous Monitoring today at Iralogix

Zebrium has continued to help detect problems and track down their root cause. Yosef related a recent and surprising example, “The incidents your system is catching have been really useful.” For example, “Someone left our company and we deleted them from all our systems. But it was really cool to see Zebrium created an incident which basically said the user has been deleted!”. What he really liked is that there was no manual rule to look for such an event, just that Zebrium detected “a correlation with a really rare event in one of the logs”. Imagine if this user had been accidentally deleted? Zebrium would have likely caught it!

The Zebrium peek function and being able to see related logs with just a click has also saved a lot of time. “With our old tool I lost all my context when creating new views”. In addition, having a structured dictionary of events and automatically parsing out the variables in those events has made the tool very helpful when looking through logs. For example, the ability to chart a variable (string or metric) just by clicking on it, or mousing over an event and seeing its structure without having to parse it.

The future

As Iralogix continues to grow, they are looking forward to having Zebrium by their side to catch and help find the root cause for any incidents that occur.