Heal Query API

by Vamsi Vedula | May 20, 2020

Using the Heal Query APIs to Build Custom Dashboards and Reports

One of our earlier blogs on the Heal Agentless Edition detailed the various APIs that Heal provides to ingest data from your existing APM tools via connectors, trigger pre-configured actions in response to events, including early warning signals and collection of forensic data, and create custom dashboards and reports based on data ingested via the streaming pipeline or events generated by MLE (Heal’s Machine Learning Engine) and persisted into the ExOps (Experience + Operations) data repository.

In today’s blog, we go over the features and capabilities of the Heal Query API which enables you to create these custom dashboards and reports. This API provides an important tool in the hands of Operations personnel to create war room screens and CXO level reports containing only the data points that are pertinent to their troubleshooting and reporting process. It gives the power to the Operations teams to not rely only on the out-of-the-box canned screens and reports that ship with Heal but focus exclusively on the data that they want to see. We will also go over some use cases to see why the Query API is an important cog in the Ops Management wheel.

Heal as an API Based Product – Recap
Heal relies on a suite of APIs to:

Ingest your enterprise’s topology and set up services and applications.
Ingest performance data i.e. workload, behaviour, log and configuration data.
Analyse data and generate signals, trigger healing and notifications.
Visualize the outcome of MLE’s actions – whether they are signals, reports, application dashboards, workload dashboards, service metric graphs or custom dashboards.

Fig 1: The Ingestion, Events, Topology and Query APIs and their role in making Heal an API based product which can be used in any monitored environment alongside an existing APM/ITOM tool.

Visualizing Heal Data via the Query API
The below diagram details the use of the Query API to retrieve data from Heal’s ExOps data repository to render the Heal user interface including transaction and service dashboards, signal reports and application health summary dashboards. In addition, the Query API also allows you to build your own custom dashboards with popular dashboard plugins and generate scheduled as well as on-demand reports.

Fig 2: Heal’s Query API retrieves data from the ExOps repository (shown in bold black) to render the user interface as well as out-of-the-box and custom dashboards and reports.

Heal Entities and Events
To understand more about how the Query API works, we delve into the basic entities of Heal as encapsulated in the below hierarchical representation:

Fig 3: The Heal Entity Hierarchy, for which the Query API can retrieve information from the ExOps repository.

The above hierarchy represents the top-level entity for monitoring as an account, uniquely identified by an ID and associated with a name, for instance – XYZ Travels. Each account has multiple applications configured as a part of it. An application is a larger unit for monitoring and raising signals, generating reports and carrying out healing, where each application is a collection of services as defined by the account administrator during configuration. For instance, the account XYZ Travels could have two applications called Booking and Payment, both of which share a common database service called Booking DB.

A service is a logical entity and represents a collection of multiple homogenous component instances and its host instances. For instance, the Booking DB service could have 4 instances of an Oracle 11g database component running on 2 Solaris 11 OS host instances. The service also has connections – both inbound and outbound – with other services.

Services are associated with Workload and Behaviour attributes represented by KPIs (metrics). These metrics are either captured by Heal’s own agents or ingested into the ExOps repository by our Ingestion API. The metrics are the basis for all workload-behaviour correlation, application behaviour learning and dynamic thresholding performed by Heal’s MLE (Machine Learning Engine). The output of the algorithms running on these attributes is a set of Events as emitted by MLE – the processed KPI values after applying aggregation operations (such as average response time, success/ failure/ slowness percentage on Request attributes, or cumulative CPU Utilization across cores in the case of a Behaviour attribute), a set of thresholds – upper and lower- for each metric as computed by MLE, and any signals (i.e. anomalies) detected on the KPIs after applying these thresholds.

A combination of the above entities and attributes as extracted by the Query API are enough to build any custom dashboard or report in Heal.

Query API Endpoints
Heal uses Keycloak for authenticating and authorizing all API calls. The main prerequisite to use the Query API is to generate an Authorization token via the Heal Keycloak server. All API calls need to be accompanied by the security token as well as the unique identifier for the Account.

Query APIs use the following base URI:
https://<keycloak_host>:<keycloak_port>/appsone-query-api/v1.0/
where keycloak_host is the IP Address or Host name of the server where Keycloak is installed, and keycloak_port is the port number on which Keycloak is running. The steps in using the Query API to get CSV entity/metric data from the ExOps repository are:

Identify the data that needs to be queried.
Create the request URI.
Make the GET HTTP call.

On successful query execution, Success Response (Response code 200) is received from the Query API with the details in CSV format. In case of failures, Failure Response (Response code 400 or 500) is received from the Query API.

The Query API pulls information about basic entities using the “config” tag and time series data of Workload and Behaviour metrics stored in the ExOps repository using the “kpi-data” tag. Some of the API endpoints that can be used to get entity and metric information include:

Get the list of applications belonging to an enterprise/account.
Get the list of services in an account.
Get all the instances belonging to a service (this includes host cluster instances, component cluster instances, host instances and component instances).
Get the list of transactions or requests that are mapped to a service.
Get specific transaction metrics like average response time of a transaction or top 10 slow/failed transactions (also called aggregate metrics).
Get behaviour metrics belonging to an instance (grouped KPI data like disk metrics and network interface metrics as well as non-grouped KPI data).

The following figure shows all the available API endpoints that are part of the Query API, segregated as per the entity that they pertain to. As can be seen, all basic entities are extracted using the “config” tag, whereas KPI data is time-series data retrieved using the “kpi-data” tag.

Fig 4: The Heal Query API endpoints segregated as per the entity/attribute/event they pertain to.

Use Cases and Samples of Query API
Example 1: Creating a Live Transaction Performance Dashboard
Operations teams always need to be on top of transaction performance and initiate corrective action the moment they see any degradation in transaction response times. A live transaction dashboard can show minute-by-minute graphs on transaction metrics like success, slow and failed percentage of requests, average response times and cumulative volumes by making the right Query API calls on the ExOps repository’s data.

The first call to be made is to get the details on all services of the application being monitored:
GET /config/accounts/{accountId}/applications/{applicationId}/services
This call returns a CSV of the form:
Service ID, Service Name, Type
E.g.:
21, OfBiz-Web-Service, Apache httpd
22, OfBiz-App-Service, Apache Tomcat
23, OfBiz-DB-Service, Oracle 11g
The second call is to get aggregated request data per service:
GET /kpi-data/accounts/{accountId}/services/{serviceId}/request-aggregates
This call returns a CSV of the form:
Request ID, Request Name, Success Count, Success Percentage, Failed Count, Failed Percentage, Slow Count, Slow Percentage
E.g. for the OfBiz-App-Service, by passing the service ID 22 to the above API endpoint, we can get a list of requests and their aggregated counts like this:
41, GET /booking/bookingMgr/hotelBooking|srv=OfBiz-App-Service|acc=5, 8958, 82.7, 793, 7.2, 907, 8.3
42, GET /booking/bookingMgr/verifyBooking|srv=OfBiz-App-Service|acc=5, 7393, 88.31, 450, 5.06, 461, 5.61
Etc.
This information can be used via a dashboard plugin like Grafana to the render a dashboard of the form:

Fig 5: Real-time Dashboard showing total volume, average response time, success, slow and failure percentage of transactions for a Booking application.

Since request aggregates are computed at minutely intervals, the API call to get request aggregates can be made every minute and this dashboard also auto-refreshed correspondingly to get a near real-time dashboard so operations teams can monitor transaction performance to ensure SLA adherence.

Example 2: Alert Reports
Operations Teams can write bash or Python scripts to generate custom reports by embedding the appropriate Query APIs in the script to pull required details. An example of this usage could be to extract monthly reports to get a list of all signals in applications that violated a threshold and hence was an indicator of a resource utilization exceeding a ceiling – either static or dynamic. These reports then become the basis for taking appropriate measures to scale the application so resource availability is unaffected, and outages can be minimized or averted altogether.

To get per application count of alerts (signals) and per application per component instance description of each signal along with the value that violated a threshold, we use following Query API:
GET /config/accounts/{accountId}/applications/{applicationId}/signals
For each application ID passed to it, this API call returns a CSV of the form:
Signal Type, Signal ID, Severity, Description, Status, Started On, Ended On
E.g.:

These CSVs can be fed into a reporting tool like Tableau or Pentaho to generate monthly scheduled reports by triggering the respective script(s). Business owners can also ask for an enhancement of these reports to join the signal ID with possible root cause unearthed during the troubleshooting process and fed back into a ticketing system as part of the resolution description. This will help isolate most common root causes behind recurrent issues in an application and address them.

The above two use cases illustrate how the Query APIs can be an effective tool in the hands of Operations Teams if utilized effectively. For more information on the Query API endpoints, the expected inputs and outputs thereof, please go through the documentation at Heal Query API.

Conclusion
This blog was a deep-dive into the Query API; thanks to the other APIs which are a part of the Heal product, it requires relatively little time and effort to incorporate self-healing into your enterprise alongside existing ITOM/AIOps tools and observe results. These results can be seen in the form of near real-time dashboards and scheduled reports – application health can be monitored more granularly and reports can illustrate month-on-month how self-healing is reducing outages and transaction slowness in your applications, thus leading to more efficient troubleshooting teams and better Ops Management. Meanwhile, you can read the rest of our API documentation at http://healsoftware.ai/docs/ and reach out to us to schedule a demo of Heal for you.