The Microsoft-CrowdStrike Outage: An In-Depth Analysis

Approximately 70% of Fortune 100 companies were impacted, and industries such as airlines, banking, healthcare, and media were particularly hard-hit. The outage resulted in substantial financial losses across various sectors. The downtime not only led to direct revenue loss but also incurred costs related to system recovery and manual intervention. Estimates of the economic impact were in the billions, with long-term repercussions on customer trust and operational efficiency.

The Sequence of the Incident

The root cause was traced to a defective update to the Falcon Sensor. The issue lay in the sensor’s need for low-level access to the system, which, when malfunctioning, could crash the entire machine.

The Technical Details

The first crucial step was identifying the root cause. This took approximately 30 to 45 minutes, during which time no one, including infrastructure and application monitoring tools like Dynatrace, New Relic, Solarwinds, Splunk, Bigpanda or HEAL Software could identify the exact problem due to the systems being unable to boot and register logs.

All these monitoring tools rely on the ability to collect and analyze logs, metrics, and traces from running systems. These tools create a comprehensive view of the IT ecosystem by continuously ingesting data from various sources. This data is then used to identify anomalies, performance issues, and potential threats in real-time. For these tools to function effectively, the systems they monitor need to be operational. They rely on the systems being able to boot up, run processes, and generate logs.

When a system experiences a severe failure, it disrupts this process. The systems crash and fail to generate new logs, leaving monitoring tools without the necessary data to analyze. These crashes were immediate and widespread, affecting millions of workstations globally. As the systems were unable to boot, they could not log new events or provide telemetry data to monitoring tools.

Can these kinds of outages be avoided?

Even large organizations like CrowdStrike, which adhere to DevOps/SDLC best practices can lead to significant outages while updating. Complex systems and the dynamic nature of cybersecurity threats mean that unforeseen interactions and rare bugs can still slip through. In this case, despite rigorous procedures, a logic error in the update led to a widespread outage. This incident underscores the need for continuous improvement and adaptation in even the most robust processes.

The Microsoft-CrowdStrike outage of July 2024 was a wake-up call for the tech industry. It highlighted vulnerabilities in even the most robust security systems and underscored the importance of continuous improvement in quality assurance processes, real-time monitoring, and incident response strategies.

About HEAL Software

HEAL Software is a renowned provider of AIOps (Artificial Intelligence for IT Operations) solutions. HEAL Software’s unwavering dedication to leveraging AI and automation empowers IT teams to address IT challenges, enhance incident management, reduce downtime, and ensure seamless IT operations. Through the analysis of extensive data, our solutions provide real-time insights, predictive analytics, and automated remediation, thereby enabling proactive monitoring and solution recommendation. Other features include anomaly detection, capacity forecasting, root cause analysis, and event correlation. With the state-of-the-art AIOps solutions, HEAL Software consistently drives digital transformation and delivers significant value to businesses across diverse industries.

Blog

The Microsoft-CrowdStrike Outage: An In-Depth Analysis

The Sequence of the Incident

The Technical Details

Can these kinds of outages be avoided?

About HEAL Software