At 8:54 pm on November 1, 2020, a customer of HDFC bank complained on Twitter that the bank’s services like internet banking and ATMs were down. More customers started raising similar issues over the next couple of hours, saying that UPI, credit card, and debit card transactions weren’t working either.
Finally, at 11:55 pm, the bank confirmed that one of their data centers faced an outage. “Restoration shouldn’t take long,” they promised. Long or not, we don’t know, but it took them twelve hours to get their services back up. These outages were so severe that RBI placed temporary restrictions on the bank on launching new digital products.
HDFC isn’t the only one. SBI had 27 outages in 2021 alone, ICICI 7 and Indian Bank 15. Even digital-native companies like Paytm had experienced outages. The consequences of such outages can be high. Like in HDFC Bank’s case, it might attract regulatory restrictions and penalties. Customers who experience regular outages might choose other banks. This is especially true of corporate customers. Negative press can lead to eroding brand value. It can also take a toll on company stock prices.
In the post-pandemic digitized world, it is critical for banking and financial services organizations to have zero downtime. That’s easier said than done, though. Today, outages are almost inevitable.
Why do outages happen?
A surge in digital interactions: In 2020, India topped the charts of digital transactions worldwide, at 25.5bn, China a distant second at 15.7bn. In addition to transactions, customers are also increasingly performing other tasks online — checking their balances, downloading statements, updating KYC documents, etc. In fact, post the pandemic, even interactions that were mandatorily done in-person, like opening an account, were being done on video. This sudden increase in the burden on digital channels for banking has weakened the systems they run on.
Disparate IT systems: “On an average, a bank has 200-400 IT applications,” says Vivek Belgavi, fintech leader at PwC India. The simplest of transactions/tasks go through multiple systems to be executed. So, when one system fails, all downstream systems are affected.
Post-facto approach: Most banks are only capable of troubleshooting after an outage occurs. They set goals of mean-time-to-resolve, always focussing on post-facto resolution. Without a thorough monitoring and analysis mechanism, outages are difficult to predict and impossible to prevent.
How can banks predict and prevent outages?
Global banks are already using advanced operations management techniques to predict and prevent outages. An emerging approach is AIOps, applying artificial intelligence (AI) and machine learning (ML) to IT operations management. It leverages continuous monitoring data to identify anomalies, prevent incidents and provide autonomous remediation capabilities to IT systems.
Here’s a checklist to modernize your monitoring landscape and prepare it for the ever-increasing complexity of banking environments.
Set up a robust monitoring system: Given the myriad tools and applications banks use, their first step is to gather data. To do this, an average bank has at least two monitoring tools, which collect large volumes of data. However, most of them don’t collect the right data. Banks must begin by collecting raw metrics and event data from multiple sources and formats and ingesting them into an AI/ML engine.
Choose and implement the right AIOps tool: Once the data is collected, it’s the job of the AI/ML engine to crunch it, identify anomalies, predict incidents, and run autonomous remediation. There are a ton of AIOps solutions in the market that integrate with monitoring solutions and collect all forms of data. But the success of an AIOps solution isn’t defined by ‘how much data can it ingest?’ but by ‘is it putting the data to good use?’ In other words, choose a solution whose AI/ML models are effective for your needs.
The state of AIOps solutions right now is that there is no differentiation in the data they can ingest (pretty much every solution can ingest all formats of data) – A key component of observability. But the real deal is the ML that powers the solution. So, while choosing an AIOps solution, consider the following:
- Integration capabilities to collect data from a wide range of sources
- Ability to accurately crunch data and identify patterns
- Factor in seasonality and workload trends to reduce false positives
- Act autonomously on the insights and fix minor issues before they escalate
- Identify potential chokepoints and optimize resources
Upgrade processes and metrics: Most ITOps teams consider mean-time-to-resolve (MTTR) as a key metric. As we’ve seen from the HDFC experience, even a few minutes of outage can have a significant impact on the bank’s topline. Therefore, banks and financial institutions should move from MTTR towards proactive metrics like the number of incidents averted.
They must automate remediation of simple/repetitive issues, saving time and improving the productivity of the ITOps teams. They must also set up their AIOps to autonomously optimize resources, as needed. For instance, based on past data or expected workloads, the AIOps engine should be able to provision extra compute resources autonomously, preventing possible outages.
The reality of today’s digital technology landscape is that outages have become extremely common. For high-velocity industries like banking, even an outage for a few minutes can halt the everyday activities of millions of customers. Unlike most industries, banks suffer not only revenue losses but also significant regulatory penalties. Zero downtime is the only acceptable goal for banks today.
To achieve zero downtime, banks need thoughtful, scalable, and stable AIOps.