How Heal’s Machine Learning Engine Supports Preventive Self-Healing

by Vamsi Vedula | Feb 26, 2020

Application Behaviour Learning and Automatic Baselining Features

Heal’s Machine Learning Engine (MLE) is the heart of the self-healing capabilities of our product. Its main capability is our patented technique for workload-behavior correlation, which is based on the premise that a system’s behavior is a direct function of the workload that it is subjected to – workload being the umbrella term used to describe all user transactions and the volume, composition and payload therein, business logic, background processes as well as read/write activity occurring on the system.

Most other AIOps tools use a break and fix approach to issues. They use AI/ML to detect and flag incidents, perform correlation between seemingly unrelated events across monitoring silos and provide variants of a potential root cause. However, any remedial actions are always after the fact; none of these tools are effective at eliminating downtime, only minimizing it.

Heal’s MLE goes a step further – aiming to prevent downtime by predicting anomalies, thus avoiding an outage altogether. The workload-behavior correlation learned by MLE is instrumental in determining which workload mixes are likely to cause an impact on system behavior and warn early to initiate a healing action – which could include and not be limited to rate limiting, transaction throttling, workload optimization or deferring discretionary/background processes. Another outcome of workload-behavior correlation is capacity forecasting i.e. generating scaling recommendation reports based on the projected workload on an application. This helps businesses plan early to projectively scale.

How does MLE Correlate Workload and Behaviour?

Baselining Phase
There are two phases to any process involving AI/ML – baselining or learning, and real-time processing of streaming data. The learning phase involves the creation of models on a set of training data. These models are persisted and periodically updated through re-baselining. Streaming data gets processed in real-time by applying these models to it, and the insights thus derived are further visualized or acted upon.

MLE creates a variety of models during the training phase. By ingesting historical workload and behavior data, it learns the way a system typically responds to specific workload signatures, and these established correlations and baselines are then applied on the real-time data to generate early warning signals and anomalies.

Some ensemble models created during the baselining process include:

Clustering (unsupervised learning)
- We use a hierarchical clustering approach that works on workload data in the form of transaction volumes, response times, workload KPI values e.g., Bytes Received on the DB layer, to create clusters of similar workloads.
- These clusters are then mapped to behavior KPI values observed during the same timestamp. For example, if the load at 11 AM was classified as Cluster 1, the corresponding range for CPU at the time is typically learned as 13%-27% which means a similar workload volume and composition should result in being clustered into Cluster 1 and a resource usage for the CPU metric of 13% to 27% normally. If the range deviates from the norm, it is anomalous.

Distribution analysis: This is used to observe the typical operating ranges for KPIs individually. Time series analysis is also used in tandem via the Prophet algorithm to establish seasonality on workload and behaviour KPI trends. Both are used to arrive at what is known as a dynamic NOR (or normal operating range) for a KPI which is a dynamic threshold band in which the KPI typically operates.

Linear regression (supervised learning): The third method to establish workload behavior correlation is through computing a relationship between workload and behavior KPI via regression techniques or equations of the form 100t1 + 200t2 +…+60t25 = 80 where 80 is the value of the CPU when the transaction mix contains 100 of transaction type t1, 200 of the type t2 and so forth till 60 of the type t25. Similar equations are learned over time for all workload mixes and KPIs.
- The KPIs which are not conducive to a linear equation like the one shown above are passed through classification techniques. This enables us to predict the KPI value for a given workload mix which is used in capacity forecasting.
  When we are unable to capture workload transaction volumes and response times, we can use certain component KPIs themselves as workload or load KPIs, like busy workers metric on a web server, the number of active threads on an application server or the bytes received on a database server. This is achieved via tagging of KPIs through an administrative console. All out-of-the-box KPIs are associated with default (editable) tags while packaging the product.

Real-time Processing of Streaming Data
In the real-time phase, streaming data is ingested via our ingestion pipeline. This includes workload and behavior data from our own native agents or data coming in via ingestion APIs and 3rd party APIs.

We examine a workload to see if it fits into any predefined clusters.
If not, then the workload is not seen typically at this time, and we use the workload-behavior prediction model to predict the behavior metric values for this workload.
- If the metric falls within the range of predicted value, then we do not raise an anomaly, else we raise an anomaly on behavior.
If the workload falls in predefined load clusters then we examine the corresponding KPI values as mapped by clustering techniques. If the behavior falls within the range, then we do not raise an anomaly else we do a NOR check.
The NOR check is to examine every individual workload and behavior KPI along with the NOR ranges and time-series models calculated during baselining. If they violate the NOR a corresponding anomaly on workload or behavior KPI is raised.
A final check is on workload-behavior prediction. If a workload mix is an outlier with respect to clustering but falls within the NOR, then we apply regression models to it to see whether we can predict the corresponding KPI values. If the KPI falls within the predicted range we do not raise an anomaly, else we do.

An example of this is a scenario at a bank at non-peak hours where, say, the expected workload is cumulatively 2000 transactions per second (TPS) but we are observing around 4000 TPS instead. For such a high workload, the CPU is expected to also spike correspondingly. Other tools would blindly raise an anomaly since CPU is spiking beyond what is normally observed at this time. Heal, on the other hand, intelligently determines that simple time series analysis, in this case, would result in a false positive. Since the workload is high, CPU is expected to be operating in a higher band. Workload – behavior correlation extrapolates that CPU for this load should typically be within 40-45%. If it is, then no anomaly is raised. However, if it exceeds this dynamically determined threshold, then an anomaly is raised on system behavior.

Essentially we are doing a triangulation to determine whether (A) the behavior is anomalous when taken independent of workload via automatic baselining or NOR check, (B) the workload is anomalous for the time we are examining it at (NOR + Time-series check), and (C) the behavior is anomalous with respect to workload via clustering and regression.

In conclusion, MLE uses its patented techniques to bring in “Predict and Prevent” capabilities to Heal. Our algorithms and models are constantly tested and refined to ensure that we can catch every single anomalous event happening in your service ecosystem, to reach a goal of zero false positives and zero false negatives, i.e. 100% precision and accuracy. With regular breakthroughs in time series and clustering techniques, we hope to reach the goal of 100% uptime very soon as well! So here’s to a zero-incident enterprise, the Heal way!