Proactive Self-Healing

by Vamsi Vedula | Feb 19, 2020

How Heal Uses Early Warning Signals to Eliminate Future Incidents

With self-healing, Heal’s mission and vision have always been to avert any untoward incidents in your enterprise; “Break and Fix” to “Predict and Prevent” is our focus. To do this, we rely on what we call Early Warning Signals detected by our proprietary Machine Learning Engine (MLE). Services that can be impacted are highlighted as soon as MLE detects an anomaly by applying a combination of techniques – Application Behavior Learning, Workload -Behavior Correlation and Dynamic Thresholding – even before a service becomes unavailable or any end user transactions start getting affected. When an early warning notification is received, it is imperative for an IT Operations user to do two things:

Know the exact entry level service(s) where users are likely to experience slowness as a result of these incidents, along with the set of end user transactions likely to get affected.
Analyze a set of possible root causes with adequate Situational Awareness, including Just-in-time forensic data, drift tracking and log parsing, so the user can drill down into the exact root cause behind the issue with all required contextual data at hand, whether it is a faulty service, network element or external factor. This becomes the basis for initiating a healing action to proactively mitigate an issue – whether it is balancing the incoming load, tweaking the service configuration, or a patch deployed to address a badly performing class or query.

The root cause analysis capability in Heal is of special significance. Most AIOps tools show a sequence of events that occurred in a service ecosystem and tag them in chronological order, most often flagging the earliest event as possible root cause. However what others call a root cause, is merely a symptom. A JVM Used memory anomaly on an application server or a high IO reads anomaly on a database is really a symptom – the root cause is most likely a poorly designed class occupying heap space in the former case, or a poorly written query causing full table scans in the latter. Occasionally, the anomaly might be due to a configuration change done on system parameters or critical config parameters in an application server’s data source or database. This is where Heal’s situational awareness and collective knowledge assist us in determining precise root cause. From then on, triggering the appropriate healing action to prevent an outage becomes more intuitive and accurate.

This blog details the proactive and remedial capabilities of Heal, where an early warning in a service has been detected and the Operations team uses Heal’s capabilities to drill down into the precise root cause of the issue. Wherever applicable and feasible, the team also takes steps to remediate the issue by triggering pre-configured healing scripts. This will address the root cause behind the issue, so service behavior can be normalized as soon as possible without any downtime or perceivable breaches in transaction SLAs.

The Service Dependency Map

On logging into the Heal dashboard, a user sees the topology of all configured services in her account (called the SERVICE DEPENDENCY MAP) under the Services menu. This topology can either be discovered through our proprietary discovery agent or ingested from your data store via our Topology API. Connectors enable the discovery of instances in cloud and containerized environments.

Fig 1: The Service Dependency Map for XR Travels, populated through Heal’s Topology API

If the admin user has configured applications (an APPLICATION is nothing but a loosely coupled set of services working together to serve a specific business purpose), an APPLICATION HEALTH DASHBOARD shows her a concise view of ongoing issue(s) in her account, along with the applications affected by the issue(s) and the severity thereof.

Fig 2: The Application Health Dashboard giving an overview of the issues affecting the different applications configured as part of XR Travels. Critical Early Warnings are currently flagged in the Bookings and Payments applications.

When the user receives a notification of an early warning in one or more applications, she is led to the Issues List page, which lists all ongoing issues. Issues can be sorted based on severity, status (Closed/Ongoing) and the start time.

Fig 3: Issues list page with issues listed as per severity and time they started. W indicates an early warning, a type of issue where no transaction SLAs have been impacted yet.

The above image shows an early warning on the Booking service in XR Travels, which will potentially impact transactions on Travel Web and Payment Web. Clicking a specific warning ID from the Health Dashboard or Issues list leads the user to the Early Warning Report page. This page is a unified pane of glass for viewing all details regarding the early warning overlaid on the service topology of the pertinent application(s) for easy navigability and intuitive impact analysis.

Fig 4: The Early Warning Report page showing the details of the warning, including the incident timeline captured as part of the various anomalies flagged by MLE and potentially impacted transactions at entry level service(s).

The early warning report page gives the user all details on the services affected, the incident timeline (which is nothing but a chronological list of related incidents – the “run-up” to the early warning), and one or more impact paths, which show all possible entry level services and corresponding transactions likely to get affected as part of this warning. The impact paths are dependent on our knowledge of the application topology and KPI relationships between involved services.

The Incident timeline is a single view of all anomalies captured by MLE by applying our proprietary ensemble modelling techniques, and elevated to a service-level incident after applying PERSISTENCE (has the anomaly persisted sufficiently to merit a second glance or is it a momentary blip?) and SUPPRESSION (is this anomaly repetitively being highlighted and do we need to suppress some occurrences before highlighting it again?) within the ML Engine.

A deep dive into one of these affected services along the impact paths leads the user to the precise reason for the incidents.

Fig 5: The incident timeline seems to indicate that Bookings app service has multiple memory KPIs across instances displaying anomalous behavior. A deep dive into the service is necessary to establish the reason behind all the memory related anomalies.

The Service Dashboard

Clicking on any service on the impact path brings the user to that Service’s dashboard page, which provides a quick overview of its health at that time. The service details on the dashboard page display all the gathered data for a service, including the workload, transaction-specific code snapshots, component and host metrics, and any code/log/database deep dive and diagnostic data that has been collected on the individual service instances. Here, in this case, looks like “Booking” service had a cluster wide Memory issue. User can click the heat map to know more.

Fig 6: Service level dashboards show an overview of service performance at cluster level and an individual instance level.

Clicking on the overview heat map brings up details, memory related metrics in this case. All service behavior related metrics are available here, arranged by categories. Anomalies are highlighted in red. Hovering on an anomaly brings up specifics as well as links to the additional diagnostics (or forensics) data. This page is a unified pane of glass containing all situational awareness and collective knowledge for troubleshooting and initiating healing.

Fig 7: Instance level graphs for memory metrics showing anomalies on individual KPIs. Links to Forensics are provided wherever available.

Just-in-time (JIT) Forensics can trigger and collect multiple diagnostics blobs for an anomaly. User can add custom scripts for collecting more diagnostic data if desired. The forensic blob can be analyzed here or downloaded for offline analysis by experts.

Fig 8: Forensic data on memory metrics, giving additional insight on possible root causes – processes hogging memory on the host, classes occupying heap space on the JVM and log snippets with OOM error information.

In this blog, we saw how just-in-time forensic data on a service dashboard leads a user closer to the root cause behind an early warning. In this case, the self-healing script in a cloud environment could include dynamically increasing the memory allocated to a container instance so Out-of-Memory errors do not recur. In an on-premise environment, manual intervention will be necessary to address the memory issues on the concerned service instance. If neither scenario is possible, a script to throttle workload at each of the possible entry level services likely to get impacted will be the appropriate healing action to configure. Heal’s Action API allows you to configure rules based on which these healing actions (in the form of scripts) can be invoked.

In our next blog in this series, we will look at how code deep dive and DB deep dive data can be utilized for pinning down an RCA attributable to poorly written code or SQL queries.