Resolving a Critical Incident in Core Banking: A Deep Dive into Application Patch Malfunction

by | Feb 9, 2024

In the dynamic environment of core banking systems, maintaining seamless operations is crucial. However, unforeseen complications can arise, leading to critical incidents that demand immediate and effective resolution. A recent incident involving an application patch malfunction presents a compelling study on the intricacies of managing and resolving system anomalies in real-time.

Incident Overview

The incident unfolded when an unusual spike in database event counts was observed within one of our client’s Core Banking Database (DB), which subsequently led to high CPU utilization on DB server nodes. Concurrently, a significant rise in the live Server count was noted on Application Nodes for the Finacle service. The Finacle service, a critical component of our Finacle banking system, is tasked with posting transactions to the database. The live server, a child process of this service, saw its instances skyrocket from the normal count of 100 to an alarming 900 on both Application Nodes.

Impact Analysis

The immediate impact was profound, leading to slowness, timeouts, and failures in core transactions. These disruptions were primarily attributed to the escalated CPU utilization on DB Nodes and the surge in live server counts on Application Nodes, specifically for the Finacle service.

Observational Insights

A detailed analysis revealed that starting from 9:00 AM, there was a gradual uptick in DB wait events on both DB servers, which swiftly escalated. By 9:03 AM, CPU utilization on the DB servers had soared from 20% to a staggering 100%. Simultaneously, the live server count for the Finacle service began its unprecedented climb.

Further analysis identified the top DB Wait Events contributing to the issue:

  • GC Buffer Busy Acquire: This event signaled a wait for a session block release by another session within the same instance.
  • Read by Other Session: Indicative of a wait for data to be read from the disk into the Oracle buffer cache, this event suggested potential query performance degradation or broader database performance issues.

Root Cause Analysis

The investigation pointed block contention, where multiple concurrent sessions attempted to read the same blocks, as a pivotal factor. Delving deeper, we identified that over 800 sessions were emanating from a singular menu invocation, leading to these wait events.

By analyzing the “pstack” outputs of the Process IDs (PIDs) on the application server, we discovered that all live server PIDs were linked to the HPBP Menu. This crucial finding suggested a direct correlation between the HPBP Menu invocation and the heightened live server counts.

Solution and Resolution

The breakthrough came when we established that a customization patch deployed for the HPBP menu the previous evening was the catalyst for the incident. By reverting this patch, we successfully mitigated the issue, restoring normalcy to our core banking operations.

This incident underscores the critical importance of rigorous testing and monitoring, especially post-patch deployment, in core banking systems. It highlights the intricate relationship between application services and database performance, emphasizing the need for a holistic approach to system maintenance and incident resolution. Through collaborative analysis and swift action, we were able to rectify a potentially debilitating issue, reinforcing the resilience and reliability of our banking services.

About HEAL Software

HEAL Software is a renowned provider of AIOps (Artificial Intelligence for IT Operations) solutions. HEAL Software’s unwavering dedication to leveraging AI and automation empowers IT teams to address IT challenges, enhance incident management, reduce downtime, and ensure seamless IT operations. Through the analysis of extensive data, our solutions provide real-time insights, predictive analytics, and automated remediation, thereby enabling proactive monitoring and solution recommendation. Other features include anomaly detection, capacity forecasting, root cause analysis, and event correlation. With the state-of-the-art AIOps solutions, HEAL Software consistently drives digital transformation and delivers significant value to businesses across diverse industries.