How We Fixed a Big Memory Problem on an App Server written in C++

by | Feb 2, 2024

In server management, high memory utilization is more than just a metric; it’s like a lighthouse signaling potential performance degradation, service disruption, and, in severe cases, complete system downtimes.

Here we delve into a recent incident involving an App Server for one of our customers, which underscores the criticality of proactive monitoring, swift incident response, and strategic problem resolution.

Incident Overview

The heart of the problem was a sudden and significant spike in memory utilization on a App Server, a fundamental component in hosting core, channel, and referral services for our customer. Memory utilization soared above 80%, triggering a series of Medium and High alerts and casting a shadow of probable impacts: transactional slowness, timeouts, and outright failures. Given the server’s integral role, the risks were high, with the frightening threat of a server hang-up that could have paralyzed operations.

The Unfolding Scenario

On an exceptionally transaction-heavy day, the IT Operations (ITOPs) team observed an alarming trend: memory utilization on the App Server, typically stable, began to climb abruptly. Within a mere 10 minutes, it rocketed from 38% to over 70%, triggering medium alerts, and swiftly breached the 80% mark, escalating to High alerts. This rapid escalation was out of the ordinary and demanded immediate attention.

HEAL Team Intervention

Our dedicated troubleshooting squadron, tasked with dissecting and diagnosing the root causes of such anomalies. Our analysis revealed that specific processes and services were disproportionately controlling memory—consuming between 28% to 42% of it. A deeper dive unearthed a correlation: the memory spikes were tied to the invocation of a particular application from the server’s menu, coinciding with database lock incidents.

Investigation of Root Cause

The investigative lens narrowed on the application in question, revealing a critical flaw: each time branch users accessed this application for account inquiries, it fetched an overwhelming volume of records, upwards of 2500+, straining the server’s memory resources to their limits.

Crafting a Solution Recommendation

The path to resolution formed around a strategic adjustment: the introduction of a custom parameter within the menu options, designed to temper the data fetching frenzy. By configuring this parameter to split data retrieval into manageable batches, both in the service start script and the menu options, the solution promised to alleviate the memory strain. This batching approach, applicable across all inquiry menus, was poised to not only address the immediate issue but also strengthen the system’s resilience against similar challenges in the future.

Lessons Learned and Paths Forward

This incident serves as a reminder of the complicated scenario between software functionality and system resources. The swift identification and resolution of the memory utilization issue underscores the importance of alert monitoring, rapid response, and, crucially, a deep understanding of the interconnectedness of system components. The recommended solution not only restored operational stability but also illuminated a path toward more sustainable system management practices.

As we reflect on this case, the broader implications for IT operations and server management become clear.

Proactive measures, continuous monitoring, and a readiness to delve deep into the system’s workings are not just recommended practices; they are indispensable pillars supporting the integrity and reliability of our digital infrastructures.

About HEAL Software

HEAL Software is a renowned provider of AIOps (Artificial Intelligence for IT Operations) solutions. HEAL Software’s unwavering dedication to leveraging AI and automation empowers IT teams to address IT challenges, enhance incident management, reduce downtime, and ensure seamless IT operations. Through the analysis of extensive data, our solutions provide real-time insights, predictive analytics, and automated remediation, thereby enabling proactive monitoring and solution recommendation. Other features include anomaly detection, capacity forecasting, root cause analysis, and event correlation. With the state-of-the-art AIOps solutions, HEAL Software consistently drives digital transformation and delivers significant value to businesses across diverse industries.