by Atri Mandal | Jun 3, 2021
More and more many monitoring and alerting solutions are adopting AI in some form to deliver alerts that help mitigate issues and highlight anomalies. However, even with the use of AI democratized, enterprises are struggling to keep up with rising infrastructure complexities and are struggling to achieve the holy grail of 100% uptime without exponentially pumping up investments. This blog is an attempt to inform ITOps leaders on the challenges of existing AI models and how HEAL can address these challenges.
HEAL’s patented AI/ML algorithms are perfectly positioned to address the needs of today’s complicated micro services based architectures. With the help of state-of-the-art AI/ML HEAL helps you predict and prevent incidents even before they occur, thereby reducing outage time and operational costs. Faster and more accurate root cause analysis and contextual alerting system improves team productivity, letting you focus on things that are truly important. HEAL’s out of the box integrations with monitoring tools and ITSM platforms means you can upgrade to the preventive paradigm without jeopardizing on existing investments.
Here’s a breakdown of HEAL’s ML capabilities and what it means to the end user:
Most existing AI based systems do a post-facto analysis of a problem, that is, they take action after the problem or outage has already happened. The remaining ones have some predictive capabilities, but they lack in mainly two ways. First, the number of alerts generated are pretty large and so the real ones are inundated by a flood of false alerts. Second, these outliers are often not interpretable which makes it difficult to take any preventive action based on them.
HEAL has a proactive approach which means it tries to prevent a problem from happening rather than reacting after the problem has happened. With the help of state-of-the-art ML models, HEAL is able to accurately predict a problem much before it actually happens and provide useful insights like probable root cause, similarity to past incidents and the impact radius.
Our prediction system comprises of both multivariate and univariate time series forecasting models which are able to capture complex relationships and trends within the metric data. With the help of these models the system can predict outliers at a very early stage. Once outliers are predicted our causal inferencing system uses a complex correlation mechanism using logs and topology to chain them into related events. This chain or sequence is then tallied or matched with known problems or incidents and an early warning is generated.
Additionally, HEAL has a continuous feedback mechanism which can learn from past misses (when an alert was missed) by incorporating information from incident data, reducing the number of recurring problems. This ensures that the recall or coverage of the system consistently improves over time, and we see lesser and lesser outages due to false negatives.
The use of both multivariate and univariate analysis provides the right amount of coverage and interpretability to the early warning system.
Existing tools and technologies need considerable amount of training data to provide meaningful AI insights. Since time series data may have seasonality and long-term trends weeks of data may be needed to train DL models. Additionally, these tools may need manual intervention from time to time to set the right thresholds.
We use a combination of ML techniques to train model weights on historical datasets and then reuse the trained weights directly for the target dataset. This enables us to run inferencing on the target dataset the moment data is ingested without explicitly training on the data and without any manual intervention. To select the correct model, we use NLP based techniques, to map the target metric to a similar metric we have seen in the past, based on input features like business vertical, metric name and description. We adopt an adaptive learning approach which uses self-supervised learning to fine tune the model weights over time as more and more data gets ingested. This ensures that the model prediction improves steadily. The time for the system to stabilize is also much shorter as we need to train only a few parameters in contrast to models which need to be fully trained.
ML insights rely on historical data and static thresholds, disproportionate workload not factored in. Also, most ML based systems perform poorly in extrapolating data and as such the forecasts are not useful.
Workload-behavior correlation is at the heart of our AI based insights for capacity planning and forecasting. The primary insight is that the behavior KPIs are dependent on the transaction or workload volumes. Our ensemble ML models are able to learn the complex non-linear relationships between workload signature and behavior metrics in addition to standard time series features like non-stationarity, seasonality and long-term trends. This results in more accurate prediction and forecasting.
Another key differentiator of HEAL is the ability to do better extrapolation. Most ML models are good at interpolation, but they do not perform so well when the test data lie in a different range compared to the training data. By designing ML models which have good extrapolation capabilities we ensure that our models can provide better insights on future chokepoints and provisioning requirements assuming hypothetical increases in workload.
One of the main problems with traditional AI based systems is that either alerts are not interpretable and so proper RCA cannot be done; or the alerts are interpretable, but the RCA does not consider full stack information, so the forensic analysis is not complete.
We use causal inference techniques to model one-way and two-way relationships between metric variables. These models are tuned using a feedback loop to take into account feedback from users and incident data and/or to dynamically adapt to changing environments. However, low-level SLA metrics are often not sufficient to obtain full stack observability and so the root cause analysis is partial. This is because there is some information that may be available in the log world which is invisible in the metric data and vice versa. As such we use advanced techniques to correlate useful log information with the metric data and topology (both vertical and horizontal) and additionally use our knowledge of workload-behavior characteristics to determine the root cause and use that for forensic analysis.
HEAL Software is a renowned provider of AIOps (Artificial Intelligence for IT Operations) solutions. HEAL Software’s unwavering dedication to leveraging AI and automation empowers IT teams to address IT challenges, enhance incident management, reduce downtime, and ensure seamless IT operations. Through the analysis of extensive data, our solutions provide real-time insights, predictive analytics, and automated remediation, thereby enabling proactive monitoring and solution recommendation. Other features include anomaly detection, capacity forecasting, root cause analysis, and event correlation. With the state-of-the-art AIOps solutions, HEAL Software consistently drives digital transformation and delivers significant value to businesses across diverse industries.