Artificial intelligence (AI) is one of the hottest topics in the world today, there are so much potential for this technology to help all sorts of Enterprise challenges. HEAL has been a leader in leveraging AI to help IT operations management for years. Our customers include some of the largest banks to the largest telcos in the world, working with them has enabled us to strengthen the AI in our core product to address many of the challenges faced by corporations large and small. Common issues such as how to address having too many alerts every day, unable to correlate data from multiple monitoring tools across different domains, unable to quickly find root cause. HEAL AI can address with them with efficiency and accuracy. In the next series of blogs, we would like to share some insights on these topics.
Over a period, corporations have been accumulating various monitoring tools. They follow the Golden Rule of Software, don’t touch that is working. Now that they have accumulated tools they are confused since nothing gives the right insight. AIOPS (Artificial Intelligence for IT Operations) came to be as corporations are trying to do more with the vast amount of data from traditional monitoring tools which they have already invested in. The basic idea is to take feeds from different monitoring tools, leverage the Machine Learning (ML) models to make sense of out that data, then use this knowledge to help critical tasks. Some of the most powerful features unlocked in this process include the ability to predict outages, correlating events/alerts from multiple sources so that they can be compressed, pinpointing from thousands of events which ones are likely to be root cause, and finally use this knowledge to come up with solution recommendations to better automate IT operations.
One of the biggest challenges for large corporations when it comes to monitoring IT is not limited visibility but too much visibility; too many tools giving too much data for engineers to digest and analyze. Another challenge is the siloed approach to IT management, individual groups such as Infrastructure, Network, Application all have their own powerful monitoring tools which engineers built their expertise on, but few are experts across different domains. The ability to correlate data and group related events therefore would be the first major feature to help with this. A single operational issue often generates hundreds if not thousands of events in the system, using ML to group these events to form the concept of an outage would give the critical information on assessing the impact of outages. While it may take hundreds of hours for humans to look through tens of thousands of events individually, baseline the trends to determine what is “normal”, and to understand how many users are getting affected, HEAL’s AI engine can do this easily in minutes.
Once HEAL’s AI engine assess, correlate, and group the events, the next step would be to bubble up the events which is likely to be root cause. Traditionally this would require domain troubleshooting experts who through years of experience can prioritize the events which need to be reviewed. By solving the root causes first, rest of the issues which are associated with the root cause will also disappear. However, most companies do not have enough of such experts. This is where HEAL comes in, providing IT operators with a list of AI suggested events which should be looked at first. One of the advantages of HEAL’s AI engine is, it will not get burned out by having more data points, the more data our AI engine processes the more accurate our assessments will be.
After the Root Causes Analysis (RCA) is complete, the issue is identified and addressed accordingly, HEAL helps our customers to optimize automating this process. HEAL’s Solution Recommendation feature takes past actions as input along with the regular monitoring data, then recommends what would be the optimal action to take should a similar issue arise in the future. Pairing with HEAL’s outage prediction engine, this is the key to implement a true automated self-healing Enterprise.