HEAL & New Relic: A Match Made in ITOps Heaven

by | Jan 19, 2021

HEAL’s AI Powered Preventive Healing Platform Can Augment Your Existing New Relic Setup to Guarantee a Zero-Downtime Customer Experience

 

With HEAL, you can:

  • Generate early warning signals to proactively alert on impending issues even before they occur.
  • Create remediation workflows which provide dynamic workload shaping and resource optimization for preventive healing.
  • Minimize alert storms and provide correlated and time-synchronized contextual data pertaining to an incident to facilitate intelligent RCA.
  • Plan capacity more effectively by projecting behavior trends in conjunction with workload growth forecasts.

How preventive healing works
Preventive healing systems like HEAL rely on a patented technique called workload- behavior correlation where workload patterns and corresponding behavioral patterns are learned and baselined. Corrective action can be taken when an abnormal workload signature arrives before an incident even occurs. This essentially means that HEAL can identify future likely events and prevent them, in order to achieve what is lately being referred to as negative MTTR.

HEAL can take three types of preventive actions:

  • Autonomous healing: Workload Optimization to handle transaction surges​ and dynamic changes to infrastructure and deployment configurations in response to lead signals (early warnings).
  • Proactive healing: Identification of potential hot spots based on recurring lead signals, giving enough time to manually take preventive actions.
  • Projected healing: Workload growth projections, identification of capacity choke points​ as well as formulation of business-aligned capacity growth plans with what-if analysis.

HEAL not only pre-emptively alerts teams on an impending incident, but also captures adequate time-synchronized context along with an event, so the sequence of events leading to an anomaly can be zeroed in on. This helps shift from MTTR to the number of incidents averted and the Mean Time Between Failures (MTBF) as the key metrics to focus on.

HEAL and New Relic
HEAL integrates seamlessly with other AIOps tools and can ingest telemetry data collected by then through an Ingestion API.

Once this data is ingested by HEAL, our Machine Learning Engine (MLE) gets to work. In batch mode, MLE can examine the workload and behavior metrics in correlation and perform Application Behavior Learning to establish baselines for expected system behavior in response to key workload signatures.

Once this step is complete and the initial set of models is ready, HEAL’s streaming analytics pipeline can pull metric data from New Relic in real-time. This will give rise to signals – the term that HEAL uses for anomaly events – on live data.

New Relic’s incident response workflows can be supplemented with HEAL’s own Actions configured in response to events/signals generated by MLE. These actions could include notifications, ticketing, invocation of remedial workflows or simply passing the event to New Relic to take further action. HEAL also integrates with various incident management tools like PagerDuty and ServiceNow via its Action API, which means remedial workflows can be integrated into the existing orchestrations within these tools.

Another important action that HEAL takes in response to an anomaly event is the collection of forensic or diagnostic data through an event handler, due to which incident data is augmented with time-synchronized application context. This proves invaluable in establishing a conclusive root cause chain should an issue ever occur. The forensic data is also used by HEAL’s Action API while initiating autonomous or proactive remedial workflows.