Capabilities of HEAL’s AI and ML Algorithms: How They Differ From Other Models

by | Jun 3, 2021

More and more many monitoring and alerting solutions are adopting AI in some form to deliver alerts that help mitigate issues and highlight anomalies. However, even with the use of AI democratized, enterprises are struggling to keep up with rising infrastructure complexities and are struggling to achieve the holy grail of 100% uptime without exponentially pumping up investments. This blog is an attempt to inform ITOps leaders on the challenges of existing AI models and how HEAL can address these challenges.

HEAL’s patented AI/ML algorithms are perfectly positioned to address the needs of today’s complicated micro services based architectures. With the help of state-of-the-art AI/ML HEAL helps you predict and prevent incidents even before they occur, thereby reducing outage time and operational costs. Faster and more accurate root cause analysis and contextual alerting system improves team productivity, letting you focus on things that are truly important. HEAL’s out of the box integrations with monitoring tools and ITSM platforms means you can upgrade to the preventive paradigm without jeopardizing on existing investments.

Here’s a breakdown of HEAL’s ML capabilities and what it means to the end user:

AI Enabled Preventive Healing

Problem we are trying to solve:
Most existing AI based systems do a post-facto analysis of a problem, that is, they take action after the problem or outage has already happened. The remaining ones have some predictive capabilities, but they lack in mainly two ways. First, the number of alerts generated are pretty large and so the real ones are inundated by a flood of false alerts. Second, these outliers are often not interpretable which makes it difficult to take any preventive action based on them.

What HEAL does:
HEAL has a proactive approach which means it tries to prevent a problem from happening rather than reacting after the problem has happened. With the help of state-of-the-art ML models, HEAL is able to accurately predict a problem much before it actually happens and provide useful insights like probable root cause, similarity to past incidents and the impact radius.

Our prediction system comprises of both multivariate and univariate time series forecasting models which are able to capture complex relationships and trends within the metric data. With the help of these models the system can predict outliers at a very early stage. Once outliers are predicted our causal inferencing system uses a complex correlation mechanism using logs and topology to chain them into related events. This chain or sequence is then tallied or matched with known problems or incidents and an early warning is generated.

Additionally, HEAL has a continuous feedback mechanism which can learn from past misses (when an alert was missed) by incorporating information from incident data, reducing the number of recurring problems. This ensures that the recall or coverage of the system consistently improves over time, and we see lesser and lesser outages due to false negatives.

The use of both multivariate and univariate analysis provides the right amount of coverage and interpretability to the early warning system.

Adaptive Learning Based Alerting System

Problem we are trying to solve:
Existing tools and technologies need considerable amount of training data to provide meaningful AI insights. Since time series data may have seasonality and long-term trends weeks of data may be needed to train DL models. Additionally, these tools may need manual intervention from time to time to set the right thresholds.

What HEAL does:
We use a combination of ML techniques to train model weights on historical datasets and then reuse the trained weights directly for the target dataset. This enables us to run inferencing on the target dataset the moment data is ingested without explicitly training on the data and without any manual intervention. To select the correct model, we use NLP based techniques, to map the target metric to a similar metric we have seen in the past, based on input features like business vertical, metric name and description. We adopt an adaptive learning approach which uses self-supervised learning to fine tune the model weights over time as more and more data gets ingested. This ensures that the model prediction improves steadily. The time for the system to stabilize is also much shorter as we need to train only a few parameters in contrast to models which need to be fully trained.

Workload Behavior Correlation

Problem we are trying to solve:
ML insights rely on historical data and static thresholds, disproportionate workload not factored in. Also, most ML based systems perform poorly in extrapolating data and as such the forecasts are not useful.

What HEAL does:
Workload-behavior correlation is at the heart of our AI based insights for capacity planning and forecasting. The primary insight is that the behavior KPIs are dependent on the transaction or workload volumes. Our ensemble ML models are able to learn the complex non-linear relationships between workload signature and behavior metrics in addition to standard time series features like non-stationarity, seasonality and long-term trends. This results in more accurate prediction and forecasting.

Another key differentiator of HEAL is the ability to do better extrapolation. Most ML models are good at interpolation, but they do not perform so well when the test data lie in a different range compared to the training data. By designing ML models which have good extrapolation capabilities we ensure that our models can provide better insights on future chokepoints and provisioning requirements assuming hypothetical increases in workload.

Faster and More Accurate Causation and Forensics

Problem we are trying to solve:
One of the main problems with traditional AI based systems is that either alerts are not interpretable and so proper RCA cannot be done; or the alerts are interpretable, but the RCA does not consider full stack information, so the forensic analysis is not complete.

What HEAL does:
We use causal inference techniques to model one-way and two-way relationships between metric variables. These models are tuned using a feedback loop to take into account feedback from users and incident data and/or to dynamically adapt to changing environments. However, low-level SLA metrics are often not sufficient to obtain full stack observability and so the root cause analysis is partial. This is because there is some information that may be available in the log world which is invisible in the metric data and vice versa. As such we use advanced techniques to correlate useful log information with the metric data and topology (both vertical and horizontal) and additionally use our knowledge of workload-behavior characteristics to determine the root cause and use that for forensic analysis.