Heal Agentless Edition

by Vamsi Vedula | Apr 15, 2020

Using the Heal APIs to ingest your data into MLE, visualize and generate signals for orchestrated healing

On reading our earlier blogs on self-healing, the patented workload-behavior correlation techniques we bring to the table and the early-warning-signal-generation capabilities of our MLE (Machine Learning Engine), one question that most enterprises would have is – “How do we avail of the myriad benefits of Heal, while protecting our investment in the array of APM, Infrastructure & Network monitoring and Orchestration tools we already have in place?” Our blog today aims to answer this question – i.e. how you can retain your existing monitoring tools and still move from “Break and Fix” to “Predict and Prevent” – using the Heal Agentless Edition.

Heal as an API Based Product
Heal relies on a suite of APIs to:

Ingest your enterprise’s topology and set up services and applications.
Ingest performance data i.e. workload, behavior, log and configuration data.
Analyse data and generate signals, trigger healing and notifications.
Visualize the outcome of MLE’s actions – whether they are signals, reports, application dashboards, workload dashboards, service metric graphs or custom dashboards.

Fig 1: Heal is an API based product. It relies on APIs to ingest, learn, heal and visualize

All the above objectives are achieved by the suite of APIs as outlined in the image below:

Fig 2: The Ingestion, Events, Topology and Query APIs and their role in making Heal an API based product which can be used in any monitored environment alongside an existing APM/ITOM tool

Setting up your Application Topology
For Heal to accurately perform workload-behavior correlation, it needs to be aware of your enterprise’s topology. Heal’s configuration dashboard allows you to set up applications, based on which workloads will be correlated with respective downstream services’ behavioral metrics and different types of healing actions can be initiated. Early warning signals, problem reports and health reports are also collated according to the applications you have set up in your enterprise.

Heal provides you a Topology API by means of which you can set up services and tag them according to the type of component they represent, configure inter-service connections and applications (which are nothing but a tag-based grouping of services fulfilling a common business objective). This data can be provided to Heal manually in the required JSON format or read from your enterprise’s CMDB (Configuration Management Database).

For example, the addApplication method of the Topology API allows you to add an application under an account. It is invoked as:

A sample JSON posted to this API in order to create an application called FlightBooking with 3 services, one each in the web, application and database tier, would look like this:

Inter-service connections can be set up by adding a connection to an existing service and specifying a destination service ID (which should also be a valid service).

A sample JSON connecting the Travel Web service to the Flights service, and the Flights service in turn to the Flights_Inventory service can be sent to the above endpoint as shown:

The Topology API thus provides all information on the services, applications and inter-service dependencies within your enterprise. With this information at hand, Heal can now ingest performance data for your services, correlate them by application ID and perform application behaviour learning via MLE.

Ingesting Data for MLE
How Heal agents typically work
Heal Enterprise Edition has its own data collection mechanisms – agents that reside on the target servers capturing transaction and behavioral metric data, pattern analysers that extract transaction information from logs and scripts to capture drift tracking (configuration change) data. There is a “super-agent” dubbed the supervisor which manages the other agents and receives triggers from the Agent Controller of Heal to collect forensics, code instrumentation and database deep-dive data. (Refer to our earlier blog on Heal Data Architecture for a refresher).

However, for Heal’s proprietary machine learning algorithms to work, it is not essential that the workload and behavior data should be captured by its own agents. Similarly, for self-healing to work, collection of forensic and deep-dive data is not a prerequisite. Hence it follows that Heal’s core capability of self-healing can be delivered to enterprises even without its agents in place.

How Heal Agentless Edition Works
The objective of Heal Agentless Edition is to deliver proactive, autonomous and projected healing in enterprises which already have other monitoring and orchestration tools in place. The data collected by 3rd party monitoring tools is captured via connectors and ingested through an Ingestion API and processing pipeline which stores all captured workload and behavior data in our ExOps (Experience + Operations) repository in a Heal-compatible format. The Machine Learning Engine then picks up this data – either in real-time or in batch mode – to derive anomalies and learn baselines respectively.

Fig 3: The Heal Agentless data ingestion process, indicated in bold black

The above figure shows the process by which data gets ingested in Heal Agentless Edition. Note the absence of an Agent Controller and native Heal Agents which are part of the Heal Enterprise Edition. Nodes A through X do not have a Heal agent but some other APM tool with which our custom connector interfaces in order to get performance data – both workload and behavior – in the format compatible with the Ingestion API. It then sends this data over the network to Heal. There are also custom connectors to data sources or repositories where performance data collected by other APM tools, network monitoring tools, log processors etc. is stored.

Once the data is validated and pre-processed by the Ingestor, it is sent over a real-time streaming pipeline to both the ExOps data repository and the Machine Learning Engine to derive insights in real-time.

The Ingestion API supports the GRPC format to ingest time-series data into the Heal streaming pipeline. It consists of:

Transaction API
- This API enables ingestion of Request or Transaction into the Heal via GRPC. The message format should specify the server IP and port which serves the request, the origin server’s IP and port, transaction type (TCP/HTTP), response time, the status of the response (success/slow/failed along with HTTP response code wherever applicable) and the size and body of the request and response.
KPI API
- This API enables ingestion of performance metric KPIs from hosts and components. The message format is expected to contain the time of KPI collection, the type of the KPI (availability, forensic, configuration watch, group KPI are some supported KPI types) and name of the KPI, along with KPI value.

Acting upon MLE Insights
Our blog on the Heal Action API detailed the three types of healing that can be carried out via our Action API. The Action API allows you to set up orchestration workflows via any ITSM/orchestration/RPA tool which is set up in your organization. Again, the outcome of a signal – whether it is simply a ticket being raised, a remedial action being carried out or a set of tasks to be carried out in the tool to orchestrate a healing workflow – does not depend on Heal’s own agents being deployed for data collection, or any specific ITSM system supported by Heal out-of-the-box. The Action API ensures that via a rich set of endpoints, you can set up any workflow that aligns with the existing ticketing and orchestration infrastructure set up within your enterprise. This is done through an Action Plugin written by Heal for the respective orchestration tool, which can:

1) Invoke a script to run when the signal is generated/ticket is created, so that appropriate healing actions can be initiated on the target servers;
2) Create a call-back URL in Heal where the outcome of the healing action can be sent for storing in our repository, generating reports etc.

The flow followed will typically be:

Fig 4: Heal’s Action API exposes endpoints to create or update a signal generated by MLE as a ticket in an ITSM system or trigger a healing action, usually a script to normalize application/infra KPIs in an orchestration tool

Visualizing Heal Data via the Query API
The below diagram details the use of the Query API to retrieve data from Heal’s ExOps data repository to render the Heal user interface including transaction and service dashboards, signal reports and application health summary dashboards. In addition, the Query API also allows you to build your own custom dashboards with popular dashboard plugins and generate scheduled as well as on-demand reports.

Fig 5: Heal’s Query API retrieves data from the ExOps repository (shown in bold black) to render the user interface as well as out-of-the-box and custom dashboards and reports

The Query API pulls information about basic entities using the “config” tag and time series data of Workload and Behaviour metrics stored in the ExOps repository using the “kpi-data” tag. Some of the API endpoints that can be used to get entity and metric information include:

To get the Account List belonging to an enterprise/account: GET /config/accounts
To get the list of Services in an account: GET /config/accounts/{accountId}/services
To get all the instances belonging to a service (this includes Host cluster instances, Component cluster instances, Host instances and Component instances): GET /config/accounts/{accountId}/services/{serviceId}/instances
To get the list of transactions or requests that are mapped to a service: GET /config/accounts/{accountId}/services/{serviceId}/transactions
To get specific transaction metrics: GET /config/accounts/{accountId}/services/{serviceId}/transactions/{transactionId}/kpis
To get behaviour metrics belonging to an instance: GET /config/accounts/{accountId}/services/{serviceId}/instances/{instanceId}/group-kpis

Using these basic API endpoints, the Query API provides you with all the information required to build a basic dashboard displaying the status and health of applications and services within your enterprise, and the performance of individual workload and behaviour metrics.

Conclusion
This blog was an attempt to orient you with the features of Heal Agentless Edition and walk you through some of the APIs that allow Heal to be integrated seamlessly in your enterprise regardless of whether any other APM/ITOM or orchestration tools are already deployed therein. For Heal to perform workload-behaviour correlation and self-healing in your enterprise, you do not need to install it with all the bells and whistles. Talking to our APIs in the required format over the required protocol will allow you to capitalize on the power of self-healing with minimal disruption and complete protection of your existing investments in infrastructure monitoring and orchestration. Our future blogs will incorporate more details on these APIs along with tutorials to familiarize you with usage scenarios.

Meanwhile, you can read our API documentation at http://healsoftware.ai/docs/ and reach out to us to schedule a demo of Heal Agentless Edition for you.