How Crowdstrike Caused Global Microsoft Outages

Introduction

In recent times, Microsoft experienced significant outages that disrupted services globally, with a pronounced impact in the Philippines. From banks to airports like Cebu Pacific, the ripple effect of these outages was felt far and wide. This blog post delves into the root cause of these disruptions, examining the role of Crowdstrike, the intricacies of memory management in C++, and how a seemingly small error can lead to major system failures.

The Culprit: A NULL Pointer in C++

Memory management is a critical aspect of software development, especially in languages like C++ that offer direct control over memory allocation and deallocation. In this case, the issue stemmed from a NULL pointer - a pointer that does not point to any valid memory location. Let’s break down what this means and why it’s significant.

Crowdstrike Analysts Photo by Zach Perpetualmaniac’s Status

What is a NULL Pointer?

In C++, a pointer is a variable that holds the memory address of another variable. A NULL pointer, however, is a special type of pointer that indicates it is not pointing to any valid memory location. It’s like having an address that says “nowhere.” This can be useful for checking whether a pointer is valid before attempting to access the memory it points to.

Why is NULL Dangerous?

When a program tries to read or write to a memory location that a NULL pointer indicates, it leads to undefined behavior. In Windows, this often results in a crash and the dreaded Blue Screen of Death (BSOD). This is because the system tries to access an invalid memory region, leading to an immediate termination of the program to prevent further damage.

The Specific Issue: Address 0x9c

In the case of the Microsoft outages, the problematic memory address was 0x9c (or 156 in decimal). This address is part of an invalid region of memory for any program. When the system attempted to read from this address, it resulted in a crash.

Understanding the Programmer’s Error

To understand why this happened, let’s consider how pointers and memory addresses work in C++. Here’s a simplified example:

struct Obj {
  int a;
  int b;
};

Obj* obj = new Obj(); // obj now points to a valid memory location

If obj were to be set to NULL:

Obj* obj = NULL; // obj now points to "nowhere"

Attempting to access obj->a when obj is NULL would try to read from an invalid memory location, leading to a crash.

The Stack Dump

A stack dump is a snapshot of the memory and the state of the program at the time of a crash. In this case, the stack dump showed that the program tried to read memory address 0x9c. This happened because the programmer forgot to check if the pointer was NULL before accessing its members. The result was an invalid memory access, causing the system to crash.

The Role of System Drivers

System drivers operate with privileged access to the computer’s hardware and core functions. When a system driver crashes due to an error like the one described, it often forces the entire operating system to crash. This is why most BSODs are caused by issues in system drivers.

Preventive Measures

To prevent such issues in the future, Microsoft and Crowdstrike can implement several measures:

Better Rollback Policies: Microsoft should improve its policies to quickly rollback defective drivers to minimize the impact on users.
Code Sanitization Tools: Crowdstrike should utilize modern tools that automatically check for NULL pointers and other common errors during development, ensuring such issues are caught early.

Conclusion

The recent Microsoft outages underscore the importance of robust error checking and memory management practices in software development. By understanding the root causes and implementing preventive measures, we can work towards more stable and reliable systems. This incident serves as a reminder of the delicate balance in software engineering, where even small errors can have widespread consequences.