Self-Healing for Site Reliability Engineering

How Heal Monitors the Four Golden Signals and Automates SRE for a Self Healing Enterprise

In our previous blog, we spoke about some issues in engineering site reliability for enterprises with ultra-dynamic workloads. Typically, the common problems SREs face in such scenarios include deployment complexity, an overwhelming amount of data monitored and sent as part of alert storms, ever-changing patterns in traffic, coupled with lack of automation to narrow down root causes and autonomously remediate issues.

Challenges before SREs driving the need for automation in enterprises handling dynamic workloads.

The solution to these problems lies in the adoption of AIOps and Self-healing tools to introduce automation in site reliability engineering. Our blog today goes through some of the ways in which such tools mitigate the above-mentioned issues, with a special focus on self-healing. We also touch upon how Heal achieves the Four Golden Signals of monitoring for SRE as laid down by site reliability engineers at Google i.e. Latency, Traffic, Errors and Saturation.

How AIOps and Self-Healing Tools help SREs

AIOps tools enable SREs to observe the real-time behaviour of applications, with systems collecting and correlating information from all related components. These solutions shift alerting from reactive to proactive, by using time series analysis and AI/ML approaches to continuously identify anomalous patterns (like the repeated checkout failures mentioned in the first blog of the series) and compare them to historical trends — meaning SREs are alerted before outages occur and business is impacted. They also use techniques like instrumentation and database deep-dive analysis to point out discrepancies at a code or query level.

The most important business driver in adopting AIOps is being able to handle ultra-dynamic workloads and improving customer experience through speed, uptime and scalability. In order to achieve this, AIOps tools provide uninterrupted monitoring of heterogenous application landscapes, flagging of performance, capacity and configuration issues proactively, event correlation for speedy root cause analysis and automation of remedial workflows.

Dimensions of automation provided by AIOps and Self-Healing tools.

Self-Healing for SREs

There is an added dimension to consider while monitoring enterprises with dynamic workloads – the effect of such a workload on the underlying infrastructure both in the short-term and long-term, and how that correlates with the most frequently occurring issues observed in the system.

Self-healing tools like Heal base their predictive signalling on the premise of workload-behaviour correlation. Our earlier blogs have spoken about the patented techniques we have built into our Machine Learning Engine (MLE) to raise signals proactively when our models detect either anomalous patterns in workload, or in the underlying system’s behaviour as a result of the workload. These same algorithms allow us to build capacity forecasting models that can chart out what-if scenarios of potential workload trends and plan scaling via projected healing to handle them.

Application Behaviour Learning (ABL) performed by MLE, in which anomalies on metrics are raised on Normal Operating Ranges (NORs) which in turn are derived based on workload.

Heal also provides an Action API through which autonomous and proactive healing measures can be put in place based on currently observed workload and correlations with system behaviour derived thereof. This helps operations teams act on predictive alerts and prevents end users from facing any perceptible slowness or failures. You can read more about this API and the different types of healing in our blog Heal Action API.

This is not to say that Heal cannot do what other AIOps tools can. Heal collects monitoring data from across disparate silos and deployment environments through its own agents, or via custom connectors and an Ingestion API. It thereby provides all data points required for an in-depth analysis of any issues in your enterprise, including log data, user journeys, code instrumentation and database deep-dive data. You can read more about Heal’s ingestion and processing pipeline in our blog on Heal Data Architecture.

The Four Golden Signals to Monitor

Most Site Reliability Engineering Handbooks are written by engineers at Google, according to whom engineering reliability effectively depends on Four Golden Signals to monitor:

Latency – Measure of whether your webpage is being served without slowness.
Traffic – Measure of the demand being placed on the system, usually measured in terms of HTTP requests per second.
Errors – Rate of requests that fail e.g.: 3 out of every 100 requests is resulting in a 500 internal server error.
Saturation – Measure of how constrained your system is under the current workload. Many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.

How Heal monitors the Four Golden Signals

In this section, we see how Heal addresses the Four Golden Signals via its self-healing capabilities:

Heal capabilities mapped to the Four Golden Signals for SR.

Monitoring Latency and Traffic

Increased workloads at the entry level service of an application might cause it to respond slowly due to downstream services facing a processing overload and subsequent resource contention. The composition of the inbound load at a service is also instrumental in establishing a causal relationship with the underlying system’s behaviour, since every transaction results in its own unique resource consumption footprint on the backend infrastructure. Heal detects these situations predictively and raises signals accordingly. Autonomous and proactive healing measures help in mitigating these signals before they turn into issues impacting service availability and transaction SLAs. Some of these measures include throttling the workload, or optimizing the workload arriving at an application to free up system resources by deferring non-priority transactions so priority transactions can be served without latency. Dynamically provisioning additional infrastructure in a cloud setup also helps in timely processing of surge workloads.

Possible self-healing actions in response to latency and increased traffic at the entry level service of an application, especially a website serving requests, API calls, etc.

Troubleshooting Errors

If transactions fail with HTTP error codes, the onerous process of troubleshooting to look for the exact cause of the error starts. For true collaboration to take place during the troubleshooting process, the first step is to increase visibility across silos and applications to gain insight into the users’ most visited application journeys. This helps to clearly identify application bottlenecks and abandonment and to triage efficiently.

For infrastructure teams, improved visibility means having insight into resource constraints impacting application performance so that teams can correlate the user experience to any underlying infrastructure in real-time. Heal does user journey mapping to zero in on problematic steps in journeys and maps them to the corresponding infrastructure elements which are involved in any ongoing signal(s). It also generates detailed real-time dashboards on journey analytics including response time for each step in a journey, conversion rates, business/technical errors encountered in the steps and possible root cause walks to the responsible service, whether it is on-premise or on cloud.

Steps in a user journey for product checkout on an e-commerce platform. The OTP Submission Step is slow and is related to an ongoing problem signal. The services it runs on are highlighted so further deep-dive analysis can be done.

Planning for Service Saturation

Heal provides two modes of healing to facilitate detection of services close to saturation:

Proactive healing via hotspot analysis: Highlighting services which are close to capacity breach based on current workload in the system.
Projected healing via capacity forecasting: Projecting workload growth trends and performing a what-if analysis of the corresponding impact on services.

Both these modes of healing are explained in detail in our blog on Proactive and Projected App-aware Scaling.

Conclusion

In this blog, we have introduced the concept of self-healing as it applies to engineering site reliability for dynamic workloads, by addressing the Four Golden Signals to monitor as per most SRE handbooks. Heal, with its proactive, autonomous and projected healing capabilities, helps mitigate most issues arising out of the unpredictability of incoming workload and its effect on the underlying infrastructure, both in the short term as well as the long term. Our next blog in the series will focus on unique SRE challenges in the E-Commerce milieu and how Heal ensures SRE for E-Commerce, ensuring maximum site uptime and customer delight with minimal disruption to business. Meanwhile, do read our other blogs on self-healing, our data architecture and the APIs we have as part of our product and don’t forget to tune in next week to understand more about how Heal helps your E-Commerce setup stay up and running 24×7!

About HEAL Software

HEAL Software is a renowned provider of AIOps (Artificial Intelligence for IT Operations) solutions. HEAL Software’s unwavering dedication to leveraging AI and automation empowers IT teams to address IT challenges, enhance incident management, reduce downtime, and ensure seamless IT operations. Through the analysis of extensive data, our solutions provide real-time insights, predictive analytics, and automated remediation, thereby enabling proactive monitoring and solution recommendation. Other features include anomaly detection, capacity forecasting, root cause analysis, and event correlation. With the state-of-the-art AIOps solutions, HEAL Software consistently drives digital transformation and delivers significant value to businesses across diverse industries.

Blog