The Five Data Pillars of Effective Root-Cause Analysis

by | Sep 8, 2021

The most effective way to understand an incident, resolve it and prevent it from occurring again is root-cause analysis. Simply put, root-cause analysis is the study performed by ITOps teams or site reliability engineers (SREs) to pinpoint the exact element/error that caused the unexpected behavior. Based on this, they plan remediation.

Accurate and timely root-cause analysis can have a direct impact on the company’s top and bottom line. Good root-cause analysis can:

  • Optimize mean time to resolve (MTTR), reducing revenue losses in the process
  • Identify those anomalies that cause incidents and help IT teams focus on them alone
  • Reduce time and cost of incident remediation

To perform accurate and timely root-cause analysis, enterprises need a robust anomaly detection mechanism. It needs to identify contextual outliers and reduce false positives. 45% of businesses are using AIOps for this. However, accurate, contextual, and relevant anomaly detection can only come from an unshakable data foundation. In this blog post, we discuss the five key datasets that form the foundation of your AIOps.

#1 Metric data

Measurements of key performance indicators over time.

Metric data are, in essence, time-series to your performance key performance indicators (KPIs) as defined in the service-level agreement (SLA) of the deployed system. Enterprises gather this by monitoring the performance of their IT assets in real-time. For instance, if CPU utilization is your metric, you will collect data about a particular application’s CPU utilization over a period of time at regular intervals. From this, you will set baselines to identify anomalies.

Some of the basic metrics you need for a successful AIOps application are:

  • CPU utilization
  • Memory utilization
  • Run time
  • Response time
  • Wait time

#2 Logs

Contextual orthogonal pieces of data help build early warning systems.

In any IT organization, applications and system logs act as the first evidence of any incident. It helps understand what went wrong, when, where, and perhaps even why. A key feature of logs is that it’s append-only, which means that it maintains the historical data/comments, giving you full context.

Site reliability engineers typically use logs because not all the information is available in metric data. For instance, while performing user impact, an SRE might need to know affected entity IDs, which will not be available in the metric data. Moreover, logs offer a richer and deeper data point for performing root-cause analysis.

#3 Topology

Relationships and dependencies among assets in the IT landscape.

Understanding the relationship between the various IT assets is critical to identifying the impact of one on another. For instance, if the app service calls one particular database service, the former will be impacted by an outage in the latter. In a complex IT landscape of infrastructure, applications, and services across multi-cloud or hybrid-cloud environments, such relationships can be the cornerstone of good root-cause analysis.

To understand this, AIOps tools use topology data. Topology is the representation of connections between a host and an incident. Tracing the topology of each incident can help gauge all impacted nodes, the quantum of impact, probability of future incidents, etc.

#4 Past alerts

History of anomalies and incidents.

For a robust anomaly detection system, your AIOps tools need all the past alerts raised by your IT assets. By correlating with previously detected anomalies, alerts, and corresponding incidents, the machine learning engine will be able to predict future outages.

The simplest way this happens is when you receive an alert, the AIOps tool will analyze past alerts for similarity. If a similar past alert had turned out to be critical, it can bump up the severity and do an impact analysis. If the past alert turns out to be a false alarm, it can suppress it.

Let’s say a server crashes when the disk is full. Based on past alerts and corresponding incidents, the SRE will know that the disk capacity reaching 90% is an early warning signal. They will be able to predict the incident — server crash — before it happens.

#5 Workload data

Performance metrics of each workload.

Most anomaly detection systems cannot detect natural changes in application behavior because they do not account for workload volumes. For instance, a simple monitoring tool based on univariate analysis will identify a spike in CPU utilization as an anomaly, even if that merely means that it is peak hour traffic. This, in reality, is contextual information.

HEAL’s proprietary workload-behavior correlation algorithms leverage this contextual data to detect anomalies accurately and effectively. We also use it to perform root-cause analysis and troubleshoot more meaningfully.

If you’re looking to set up an AIOps solution that can perform effective — and predictive — root cause analysis, speak to a HEAL consultant today.