Why is Causation Important in AIOps?

by Girish Muckai | Feb 23, 2022

Modern IT environments have become much more complex to manage thanks to hybrid infrastructures and comprehensive instrumentation that generate metrics, alerts and events data constantly. ITOps (IT Operations) and SRE (Site Reliability Engineering) teams are tasked with providing superior performance and user experience for the numerous applications while not letting the budget out of hand.

Observability of applications and infrastructure (compute, storage, networking, Database and other elements) provide great visibility into the inner workings. However, this also generates numerous and often unwieldy number of alerts and events. It becomes harder and harder for Operations teams to tell false alerts from actual ones they have to attend to, leading to chasing the wrong problems and not dealing with what matters. ITOps teams have been emphasizing a need for more sophisticated approaches. Advances in AI (Artificial Intelligence), thankfully, can bridge this gap.

AI has been finding its way to ITOps problems over the past several years with increasing effectiveness and sophistication. Unsupervised learning, self-supervised learning and transfer learning models have become very efficacious.

When the users of an application experience an outage or service level degradation (anomaly), there is always an underlying reason for this – it does not happen for no reason – something has to have caused this. This is one thesis on which diagnostics and solution recommendations are rooted in. If this cause repeats itself whenever some anomalies or outages happen, then it is up to the AI model to help the ITOps teams unearth this. If we are able to do this and with reasonable certainty, then we can not only fix a problem quickly, but more importantly, prevent this issue from happening in the first place.

Advances in causation modeling approaches, especially with time series data, are allowing AIOps solutions to help ITOps teams develop and identify causal graphs that become reliable ways to operate a sophisticated IT environment that can last well into the future.