Self-Healing for Site Reliability Engineering

by Vamsi Vedula | Apr 29, 2020

How Heal Monitors the Four Golden Signals and Automates SRE for a Self Healing Enterprise

In our previous blog, we spoke about some issues in engineering site reliability for enterprises with ultra-dynamic workloads. Typically, the common problems SREs face in such scenarios include deployment complexity, an overwhelming amount of data monitored and sent as part of alert storms, ever-changing patterns in traffic, coupled with lack of automation to narrow down root causes and autonomously remediate issues.

Fig 1: Challenges before SREs driving the need for automation in enterprises handling dynamic workloads.

The solution to these problems lies in the adoption of AIOps and Self-healing tools to introduce automation in site reliability engineering. Our blog today goes through some of the ways in which such tools mitigate the above-mentioned issues, with a special focus on self-healing. We also touch upon how Heal achieves the Four Golden Signals of monitoring for SRE as laid down by site reliability engineers at Google i.e. Latency, Traffic, Errors and Saturation.

How AIOps and Self-Healing Tools help SREs
AIOps tools enable SREs to observe the real-time behaviour of applications, with systems collecting and correlating information from all related components. These solutions shift alerting from reactive to proactive, by using time series analysis and AI/ML approaches to continuously identify anomalous patterns (like the repeated checkout failures mentioned in the first blog of the series) and compare them to historical trends — meaning SREs are alerted before outages occur and business is impacted. They also use techniques like instrumentation and database deep-dive analysis to point out discrepancies at a code or query level.

The most important business driver in adopting AIOps is being able to handle ultra-dynamic workloads and improving customer experience through speed, uptime and scalability. In order to achieve this, AIOps tools provide uninterrupted monitoring of heterogenous application landscapes, flagging of performance, capacity and configuration issues proactively, event correlation for speedy root cause analysis and automation of remedial workflows.

Fig 2: Dimensions of automation provided by AIOps and Self-Healing tools.

Self-Healing for SREs
There is an added dimension to consider while monitoring enterprises with dynamic workloads – the effect of such a workload on the underlying infrastructure both in the short-term and long-term, and how that correlates with the most frequently occurring issues observed in the system.

Self-healing tools like Heal base their predictive signalling on the premise of workload-behaviour correlation. Our earlier blogs have spoken about the patented techniques we have built into our Machine Learning Engine (MLE) to raise signals proactively when our models detect either anomalous patterns in workload, or in the underlying system’s behaviour as a result of the workload. These same algorithms allow us to build capacity forecasting models that can chart out what-if scenarios of potential workload trends and plan scaling via projected healing to handle them.

Fig 3: Application Behaviour Learning (ABL) performed by MLE, in which anomalies on metrics are raised on Normal Operating Ranges (NORs) which in turn are derived based on workload.

Heal also provides an Action API through which autonomous and proactive healing measures can be put in place based on currently observed workload and correlations with system behaviour derived thereof. This helps operations teams act on predictive alerts and prevents end users from facing any perceptible slowness or failures. You can read more about this API and the different types of healing in our blog Heal Action API.

This is not to say that Heal cannot do what other AIOps tools can. Heal collects monitoring data from across disparate silos and deployment environments through its own agents, or via custom connectors and an Ingestion API. It thereby provides all data points required for an in-depth analysis of any issues in your enterprise, including log data, user journeys, code instrumentation and database deep-dive data. You can read more about Heal’s ingestion and processing pipeline in our blog on Heal Data Architecture.

The Four Golden Signals to Monitor
Most Site Reliability Engineering Handbooks are written by engineers at Google, according to whom engineering reliability effectively depends on Four Golden Signals to monitor:

Latency – Measure of whether your webpage is being served without slowness.
Traffic – Measure of the demand being placed on the system, usually measured in terms of HTTP requests per second.
Errors – Rate of requests that fail e.g.: 3 out of every 100 requests is resulting in a 500 internal server error.
Saturation – Measure of how constrained your system is under the current workload. Many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential. For instance, a CPU Utilization of 100% across all available cores is possible but not desirable, so a utilization target of 80% can be set before the system alerts you on CPU usage.

How Heal monitors the Four Golden Signals
In this section, we see how Heal addresses the Four Golden Signals via its self-healing capabilities:

Fig 4: Heal capabilities mapped to the Four Golden Signals for SR.

Monitoring Latency and Traffic
Increased workloads at the entry level service of an application might cause it to respond slowly due to downstream services facing a processing overload and subsequent resource contention. The composition of the inbound load at a service is also instrumental in establishing a causal relationship with the underlying system’s behaviour, since every transaction results in its own unique resource consumption footprint on the backend infrastructure. Heal detects these situations predictively and raises signals accordingly. Autonomous and proactive healing measures help in mitigating these signals before they turn into issues impacting service availability and transaction SLAs. Some of these measures include throttling the workload, or optimizing the workload arriving at an application to free up system resources by deferring non-priority transactions so priority transactions can be served without latency. Dynamically provisioning additional infrastructure in a cloud setup also helps in timely processing of surge workloads.

Fig 5: Possible self-healing actions in response to latency and increased traffic at the entry level service of an application, especially a website serving requests, API calls, etc.

Troubleshooting Errors
If transactions fail with HTTP error codes, the onerous process of troubleshooting to look for the exact cause of the error starts. For true collaboration to take place during the troubleshooting process, the first step is to increase visibility across silos and applications to gain insight into the users’ most visited application journeys. This helps to clearly identify application bottlenecks and abandonment and to triage efficiently.

For infrastructure teams, improved visibility means having insight into resource constraints impacting application performance so that teams can correlate the user experience to any underlying infrastructure in real-time. Heal does user journey mapping to zero in on problematic steps in journeys and maps them to the corresponding infrastructure elements which are involved in any ongoing signal(s). It also generates detailed real-time dashboards on journey analytics including response time for each step in a journey, conversion rates, business/technical errors encountered in the steps and possible root cause walks to the responsible service, whether it is on-premise or on cloud.

Fig 6: Steps in a user journey for product checkout on an e-commerce platform. The OTP Submission Step is slow and is related to an ongoing problem signal. The services it runs on are highlighted so further deep-dive analysis can be done.

Planning for Service Saturation
Heal provides two modes of healing to facilitate detection of services close to saturation:

Proactive healing via hotspot analysis: Highlighting services which are close to capacity breach based on current workload in the system. Latency increases are often a leading indicator of saturation. Examining the 99th percentile response time using trend analysis can give a hotspot indication of services close to saturation.
Projected healing via capacity forecasting: Projecting workload growth trends in an application and performing a what-if analysis of the corresponding impact on its services. This might reveal services which are under-provisioned i.e. can handle only 10% more traffic or handle even less traffic than they currently receive – or services which are overprovisioned i.e. can comfortably handle double the traffic they currently receive.

Both these modes of healing are explained in detail in our blog on Proactive and Projected App-aware Scaling.

Conclusion
In this blog, we have introduced the concept of self-healing as it applies to engineering site reliability for dynamic workloads, by addressing the Four Golden Signals to monitor as per most SRE handbooks. Heal, with its proactive, autonomous and projected healing capabilities, helps mitigate most issues arising out of the unpredictability of incoming workload and its effect on the underlying infrastructure, both in the short term as well as the long term. Our next blog in the series will focus on unique SRE challenges in the E-Commerce milieu and how Heal ensures SRE for E-Commerce, ensuring maximum site uptime and customer delight with minimal disruption to business. Meanwhile, do read our other blogs on self-healing, our data architecture and the APIs we have as part of our product and don’t forget to tune in next week to understand more about how Heal helps your E-Commerce setup stay up and running 24×7!