HEAL Software – Understanding the Unknown Unknowns

by | Aug 16, 2024

Challenges Organizations Face in Identifying Unknown Unknowns

The term “unknown unknowns” refers to problems or vulnerabilities that have not yet been identified or anticipated. Unlike known issues, which can be addressed with existing knowledge and tools, unknown unknowns require a different approach to detection and resolution. These hidden issues are often beneath the surface, only becoming apparent when they cause significant disruption. The ability to uncover these unknowns is critical for maintaining system stability and preventing unexpected outages or performance degradation.

Organizations face numerous challenges when identifying unknown unknowns within their IT infrastructure as they are characterized by their complexity, with numerous interconnected systems, applications, and networks. This sophisticated web of dependencies makes it exceedingly difficult to trace the origins of issues, often resulting in prolonged downtimes and the volume of data generated by applications, logs, and performance metrics can be tremendous. The dynamic nature of IT systems—constantly evolving with frequent updates, new integrations, and configuration changes—introduces new variables that can lead to unforeseen issues.

Understanding and addressing unknown unknowns helps prevent downtime by identifying and resolving issues before they escalate into major incidents, ensuring that systems run smoothly and efficiently. Security is another critical aspect; unknown vulnerabilities can be exploited by malicious agents, leading to potentially severe breaches. Proactively addressing these issues can result in significant cost savings by avoiding the expenses associated with downtime, lost productivity, and emergency fixes.

Unanticipated Transaction Processing Delays

One scenario most organizations face is the challenge of maintaining a complex IT infrastructure to support their online banking services, which include real-time transaction processing, account management, and customer service functions. The bank’s IT infrastructure includes a mix of legacy mainframe systems, modern microservices, distributed databases, and multiple third-party service integrations.

One of our organizations started receiving intermittent customer complaints about significant delays in processing transactions. These delays were not consistent and do not occur at any predictable times, making them difficult to diagnose. Traditional monitoring and logging tools did not reveal any clear patterns or obvious causes for the delays.

The transaction delays were intermittent, sometimes occurring during peak hours and sometimes during off-peak times. Different types of transactions (e.g., fund transfers, bill payments) experienced varying levels of delay. Initial checks were on server loads, network latency, and database performances.

Observations

Intermittent high response times from external services caused cascading delays in transaction processing. Certain transaction types induced higher load on specific database tables, leading to locking issues. Inefficient retry mechanisms in the microservices handling external API calls add to the processing delays.

How HEAL Software Can Identify – Mean Time to Identify (MTTI)

Continuous Monitoring and Data Collection

Based on the observations, HEAL Software is designed to continuously oversee the bank’s entire IT environment. It gathers comprehensive data from a wide array of sources, including:

From the core banking system, capturing detailed information about each transaction.

From the database, application servers, and network components.

Application Data

Network Components

From interactions with third-party services, detailing request and response times.

HEAL Software uses the data from each dataset to correlate and trace the flow of issues through different components of the IT environment.

  1. Identifying Anomalies:
    • Transaction Data (TXN123457): Intermittent spikes in transaction processing times at 10:01:45 due to network timeouts.
    • Database Data (QRY1236): A failed query at 10:01:10 due to a timeout, indicating high lock contention or other database performance issues.
    • Interactions Data (INT123457): A failed connection to the Credit Scoring service at 10:01:45, showing that third-party API interactions are unreliable.
    • Application Servers (APP03): High CPU usage and error rates at 10:01:10, contributing to increased response times and potential processing delays.
    • Network Components (NET03): Slight packet loss at 10:01:10, suggesting network instability which could impact both database queries and API calls.
  2. Causal Analysis:
    • Database Performance Issues: The failed query (QRY1236) at 10:01:10 due to a timeout indicates high lock contention or inefficient query execution, affecting overall transaction processing times.
    • Third-Party API Reliability: The failed API call (INT123457) to the Credit Scoring service at 10:01:45 suggests that third-party services are unreliable and contribute to processing delays.
    • High CPU Usage on Application Server: The high CPU usage and error rates (APP03) at 10:01:10 indicate that the application server is under heavy load, further slowing down transaction processing.
    • Network Instability: The slight packet loss (NET03) at 10:01:10 indicates network issues, which can cause delays in both database queries and API calls.

How HEAL Software Can Resolve – Mean Time to Resolve (MTTR)

Automated Remediation Suggestions:

    • Optimize Database Queries: HEAL suggested optimizing the query (QRY1236) that timed out and caused high lock contention.
    • Improve Third-Party API Handling: HEAL recommended improving the timeout settings and retry logic for the Credit Scoring API (INT123457).
    • Enhance Server Performance: HEAL advised addressing high CPU usage on the application server (APP03) to prevent high error rates and improve response times.
    • Network Stability: HEAL recommended investigating and mitigating the slight packet loss (NET03) to ensure network stability.

The correlation across transaction data, database data, interaction data, application server data, and network components allows HEAL Software to:

  • Detect anomalies in real-time by correlating events happening at similar times.
  • Perform causal analysis to trace issues back to their root causes.
  • Provide automated remediation suggestions to resolve the identified issues and prevent future occurrences.

The root cause of the issue is a combination of high lock contention in the database, unreliable third-party API interactions, high CPU usage on the application server, and network instability. These factors interact and compound each other, leading to intermittent spikes in transaction processing times.

By addressing these root causes, HEAL minimizes system downtime and its associated costs, ensuring business continuity. The platform’s proactive approach ensures that systems run efficiently, enhancing overall performance and user satisfaction.

Our advanced AI and machine learning capabilities ensure that unknown unknowns are identified and addressed swiftly and effectively. With a focus on proactive issue resolution and automated remediation, HEAL Software ensures that your IT systems remain robust, reliable, and efficient.

About HEAL Software

HEAL Software is a renowned provider of AIOps (Artificial Intelligence for IT Operations) solutions. HEAL Software’s unwavering dedication to leveraging AI and automation empowers IT teams to address IT challenges, enhance incident management, reduce downtime, and ensure seamless IT operations. Through the analysis of extensive data, our solutions provide real-time insights, predictive analytics, and automated remediation, thereby enabling proactive monitoring and solution recommendation. Other features include anomaly detection, capacity forecasting, root cause analysis, and event correlation. With the state-of-the-art AIOps solutions, HEAL Software consistently drives digital transformation and delivers significant value to businesses across diverse industries.