Heal Data Architecture

by Vamsi Vedula | Mar 4, 2020

How Ops Teams Can Use Heal’s Configuration Drift Tracking and Log Parsing Capabilities to Troubleshoot

Using the Lambda and Kappa data flows in our Machine Learning

Quick Recap on MLE
Heal’s Machine Learning Engine (MLE) is the heart of the self-healing capabilities of our product. Its main capability is our patented technique for workload-behavior correlation, which is based on the premise that a system’s behavior is a direct function of the workload that it is subjected to – workload being the umbrella term used to describe all user transactions and the volume, composition and payload therein, business logic, background processes as well as read/write activity occurring on the system.

MLE can facilitate self-healing by predicting anomalies, thus avoiding an outage altogether. The workload-behavior correlation learned by MLE is instrumental in determining which workload mixes are likely to cause an impact on system behavior and warn early to initiate a healing action – which could include and not be limited to rate limiting, transaction throttling, workload optimization or deferring discretionary/background processes.

How does our data architecture drive baselining and streaming analytics?

Baselining vs. Streaming Phase
There are two phases to any process involving AI/ML – baselining or learning, and real-time processing of streaming data. The learning phase involves the creation of models on a set of training data. These models are persisted and periodically updated through re-baselining. Streaming data gets processed in real-time by applying these models to it, and the insights thus derived are further visualized or acted upon.

The Lambda vs. Kappa data flows in MLE
Lambda architecture is a data-processing architecture designed to handle large quantities of data by utilizing batch and stream-processing methods. Batch processing is used to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. The data model for Lambda architectures is intended for ingesting and processing timestamped events that are appended to existing events rather than overwriting them. State is determined from the natural time-based ordering of the data.

Heal uses a NoSQL database and a time series database – what we collectively call an ExOps data lake – to store the workload and behavior data models that are used in the baselining/learning phase of MLE. The baselining data flows into MLE via a Query layer which consists of multiple data services. This data flow happens in batch or non real-time mode.

Another scenario for non real-time processing is in the generation of reports. The Query layer can retrieve the required data from the ExOps data lake to generate on-demand and scheduled reports.

The Kappa architecture pattern is an alternative to Lambda wherein data is streamed through a computational system and fed into auxiliary stores for serving. Kappa Architecture is a simplification of Lambda Architecture. A Kappa Architecture system is like a Lambda Architecture system with the batch processing system removed. To replace batch processing, data is simply fed through the streaming system quickly.

In Heal, the Kappa pattern is used while applying MLE models to real time workload and behavior data, collectively known as performance data. This includes data from our own native agents or coming in via ingestion APIs and 3rd party APIs through connectors.

Machine Learning Models are selected for various use cases by evaluating results of multiple models on production data and validating the results thereof through manual and automated labelling. Examples of model errors include invalid anomalies for a given transaction mix, early warnings which did not culminate in outages, flow rates marked anomalous although they were within expected bounds for a given business hour window etc. Models are packaged and associated with task queues in an asynchronous task processing framework, which specifies expected sources and sinks of data for each model. These workflows can be invoked over REST APIs.

Real time data flows into MLE via a processing pipeline and is processed in a streaming mode to send signals to the event handler. Based on the type of signal, the event handler can initiate notifications, integrate with incident management tools or pump data into real-time dashboards. A special case of the event handler processing MLE output is the triggering of appropriate forensic, code/database deep-dive and healing actions through the Controller.

Fig 1: The Lambda and Kappa patterns in Heal data architecture. The teal lines indicate Kappa and grey lines indicate Lambda data flows.

In our next blog of this series, we will see how Heal’s Action API is used to trigger forensic actions and healing actions in the context of a signal raised by MLE.