Evolving from AIOps to Self-Healing, the Appnomic Way

by Vamsi Vedula | Jan 22, 2020

AIOps and Cognitive Operations are much used (or dare we say, overused) terms in the APM world. AIOps was first introduced by Gartner to refer to the set of APM tools that use AI to perform a specific set of functions in the APM universe. These could include raising alerts on anomalous infrastructure and network behavior, correlation of events across silos to identify a root cause and extracting information from unstructured data sources like server logs and changelogs.

Forrester introduced a similar term – Cognitive Operations – to refer to tools which had evolved beyond basic APM to IT, Network and Infrastructure Operations using ML techniques, and although this term did not take off the way AIOps did, it was used to describe similar capabilities.

Today, many erstwhile pureplay APM products have reinvented themselves as AIOps or Cognitive Operations tools. Yet, vendors tend to use the term AI very loosely, confusing customers. Behind the scenes, these tools rely solely on processed event data, or a post-facto correlation of seemingly disparate events to “intelligently” raise performance alerts and stitch accurate root cause. Most, if not all of them use a combination of business rules, dynamic thresholds inferred through machine learning techniques and temporal analysis of structured and unstructured data gathered from various monitored silos to raise what they refer to as “proactive alerts”.

Yet, these questions remain.

If the alerts are indeed proactive, why do incidents still occur?
Why is MTTR or Mean Time to resolve still the primary metric to gauge the tool’s success?
Why are these tools unable to predict how an issue in one silo is likely to affect another silo, another service, or the entire service/application ecosystem? (Incidentally, many of these services might be spanning many deployment environments, including On-premise, cloud, hybrid, and edge.)

True intelligence ideally lies not in gathering data to fix an issue, but in predicting an impending issue and preventing it from occurring. While AIOps tools rely heavily on Big Data and associated methodologies to infer incidents and root cause, here are some pitfalls they encounter:

Big Data cannot predict unusual events and is blinded to zero–day events i.e. issues that have not yet occurred.
Big Data based machine learning is inept at processing incorrect information. For example, if the clock on one system is two minutes off, then timestamps in all the log files from that system also have a cascading effect, contaminating temporal analysis which informs root cause analysis.
Big Data based learning cannot deal with imputing i.e., filling in missing data points. For instance, assume ten systems are involved in a cascading failure that across applications, infrastructure, storage, network, and applications, with the root cause affecting the first system, which affects some part of the stack in the next system in turn and thereby creating a chain of failures. If we had log and monitoring data for all the systems, then big data enabled machine learning would work fine. But assume one of the systems in the middle of the sequence was not generating any data. In this situation, big data cannot fill in the blanks and may miss the causal connections altogether as a result.
A challenge for humans and machine learning-based software alike is when there are two or more independent causes of a problem. A straightforward “what changed” analysis is unlikely to uncover either of the causes.

This is where self-healing comes in. Our aim with self-healing is to shift from a “break and fix” paradigm to “predict and prevent” (see November 2019 blog). By employing streaming analytics on real-time data and applying a patented technique called workload-behavior correlation, Appnomic aims to highlight anomalies, called early warning signals, which might lead to an issue. The technique heavily relies on the hypothesis that the behavior of a system is a direct function of the workload that arrives on it at any point in time. An abnormal workload that deviates from the norm in terms of volume, composition and/or payload will result in abnormal behavior as well.

What follows is a corollary is that these workload patterns can be learned by a system, along with seasonal trends and corresponding behavior at the time. This becomes the basis for predicting that an issue is likely to occur when there is any workload mix that is “off” and puts the system under stress. True prediction allows you the time to put in place the required actions to prevent the issue. These actions – called healing actions – could include spinning off additional container instances to handle the increasing workload, throttling incoming transactions at the load balancer layer, tweaking certain environment variables at runtime, etc. – all configurable through an Action API.

This is not all. Appnomic also employs event correlation via real-time data processing pipelines, which means that even if your services are running in different time zones or environments, the temporal aspect of the event correlation relies not on the timestamp within the data, but the recency of the data in the processing queue, leading to root cause analysis which is more holistic and accurate. Workload and behavior data are supplemented with log file snippets, configuration changes as well as deep-dive code snapshots and just-in-time diagnostic data which give a 360 degree of the system’s state at the time of the blip.

So what are you waiting for? Move from AIOps to self-healing now, and remove “MTTR” from your vocabulary, focusing instead on the number of incidents you successfully managed to avert at your enterprise!