CrowdStrike: What happened, why did it happen, and how can we stop it from happening again?

(Attention conservation notice: long explanation of the causes of the July 19, 2024 CrowdStrike outage, including outsider-friendly explanations of basic low-level programming and OS development concepts that are at least partially skimmable if you already have a relevant programming background. If you don't feel like reading lots of technical details, then this may not be the article for you, though I think the details are important for the public and, especially, policymakers to understand.)

I. Introduction

On Friday, July 19th, at 04:09 UTC, just after midnight New York time, millions of computers around the world crashed almost simultaneously. The cause was an update to CrowdStrike Falcon, an Endpoint Detection and Response (EDR) program developed by the cybersecurity firm CrowdStrike, which monitors and logs computer operations in order to thwart cyberattacks. The consequences were severe: air traffic in much of the world came to a near halt, with some ticket agents resorting to handwritten boarding passes. Even more direly, emergency dispatch systems were also crippled, as were several nations' health care systems. Microsoft estimated that 8.5 million computers were affected.

The tech industry already had a reputation in some circles for never letting safety come before profit or expedience, and CrowdStrike's reckless action—which precautions such as sending the update only to a few clients who opted in to bleeding-edge updates before pushing it to all clients, could have mitigated—will doubtless further convince Big Tech's critics that the industry needs a shorter regulatory leash. But the outage is not a simple story of underregulation: if anything, regulations had a heavy hand in making CrowdStrike, whose products are intrinisically difficult if not impossible to make safe, a single point of failure.

II. CPU workings, and dangerous programming languages

The cause of the CrowdStrike Falcon outage was a configuration file update that caused an invalid pointer dereference: the program tried to access nonexistent data. The term may sound intimidating, but it's less frightening than it sounds, and understanding the technical details is worthwhile.

Most of a computer program is an interaction between random-access memory or RAM, which stores the program's working data, and its central processing unit or CPU, which manipulates the data. RAM can be understood abstractly as a row of boxes called words, each of which is labeled with an integer address and contains some number of bits: today, usually 64. These bits, in turn, could represent a value of several different types: for instance, an integer, a fractional number (typically represented in a particular format called floating point), or even the address of another word; a word whose value is a memory address is often called a pointer. But for all that the hardware is concerned, a word is just a bit pattern, and programmers have to be careful not to give, say, an integer value to a CPU instruction that will interpret the same bit pattern as a floating-point number.

Pointers let programs can build data structures with interlinked components that refer to each other by address. For instance, in the linked list structure for storing an ordered list, every list item has a separate record in RAM with two entries: the item itself, and the address of the next record. (The last record has a placeholder null pointer which is not a valid RAM address; this is usually address zero on modern computers.)

The CPU, meanwhile, consists essentially of a small set of word-long registers plus logical circuits for executing a set of instructions on registers and RAM. The most important classes of CPU instructions are load and store instructions which move data from a specified RAM address to a register or vice versa; arithmetic operations on registers, such as adding one register to another; and special instructions for moving data to and from peripheral I/O devices such as keyboards and hard disks. A computer program is basically a list of CPU instructions executed, for the most part, in sequence, though there are branch instructions that can jump to another location in the code if some logical condition is satisfied. Branches enable constructs such as loops, if–then statements, and routines: fragments of code that can be executed (or called) from anywhere else, even other programs.

Writing CPU instructions (or "machine code") directly is tedious and error-prone, so most programmers work in more concise high-level languages that can be translated into machine code with a special program called a compiler. High-level languages are not just faster to work in but also safer, because compilers can check for certain types of errors. One common safeguard in high-level languages is a type system: in high-level languages such as C++, the compiled language that CrowdStrike Falcon was almost certainly written in, a programmer must give every variable ("variable" basically means "name attached to a location in RAM") a type such as integer, text character, or pointer, and variables can only be used in operations compatible with their type. For example, only variables with pointer types can be dereferenced: that is, have their contents interpreted as a RAM address for the sake of reading or modifying that address’s contents.

But a type system alone can't stop all errors, and C++'s safety leaves much to be desired. C++ is an old language: it was released in 1985 as an extension of an older, lower-level language, C, which provides only a thin wrapper over machine code. In C and C++, it’s easy to construct and dereference pointers to unintended locations, such as by dereferencing a null pointer: such a mistake usually crashes the program. Many modern languages, but not C++, eliminate or confine null pointer errors by augmenting the type system to have distinct possibly-null and guaranteed-not-null pointer types. To dereference a possibly null pointer, the compiler can require programmers to check whether the pointer is null beforehand, a check that is easy to forget in C++, or to use an explicit dangerous dereference operator.

There are other ways in C++ to create invalid pointers. For instance, C++ variables can be inadvertently corrupted by careless use of arrays, which are lists laid out end-to-end in memory: C++ code to read or modify an array element compiles into machine code that doesn't check that the element actually exists. For instance, if a program tries to change the sixth element of a five-element array, then the resulting machine code will blithely modify the location outsie the array where the sixth element would have been locatd, possibly overwriting other data. It is also possible in C++ to declare a variable without assigning it an initial value. In this case, the variable’s value will be whatever detritus was in the memory location that the compiler assigns to the variable, creating seemingly random program behavior. Languages more modern than C++ have safeguards against both errors: for instance, bounds-checked array access operators that signal an error rather than silently corrupting data if a program tries to modify a nonexistent array element; and stricter requirements that all data be initialized before it’s used. (C++ has also added a bounds-checked array type, but many programmers avoid it because the dangerous legacy array type can be slightly faster.)

One or more of these issues, to judge from third-party analysis and CrowdStrike's own reporting, was at fault in the Falcon outage. Because of corruption of a configuration file, Falcon likely incorporated a list of pointers with wrong values into a data structure, and a dereference of one of these pointers caused Falcon to crash. The value of the invalid pointer varied from one instance of the bug to another, suggesting that uninitialized data may be at fault; CrowdStrike itself has said that the fatal pointer dereference was a read from out-of-bounds memory. This analysis is vague, but it’s hard to say more without Falcon's C++ source code.

The safety problems of C and C++ have been understood for decades, and many projects are phasing out the languages as a result: for instance, the Linux operating system is replacing portions of its C code with Rust, a newer, more safety-conscious language. (CrowdStrike itself has started using Rust for some newer features, though legacy programs such as Falcon are almost certainly still mostly C++.) Modern languages also have richer type systems that can be used, in one phrase, to "make illegal states unrepresentable": data structures can have detailed enough type information that their contents must be valid if the code that constructs them passes the compiler's type checker. It's at least possible that this could have prevented the "bug in the Content Validator" that, according to CrowdStrike's reporting, let the malformed configuration file go undetected.

More recent versions of C++, similarly, have introduced several opt-in safety features. But the language does not enforce their use, and as long as C++ survives, so too will its pitfalls.

III. Operating systems and the intrinsic danger of EDR software

Everyone reading this has likely seen a computer program crash. Usually, though, a crash just closes the affected program; it doesn't wreck the whole computer. It’s worth understanding why the CrowdStrike Falcon crash was so severe, because it has an important implication: Endpoint Detection and Response software may guard against security threats, but it is unavoidably a security threat itself. This fact emerges from fundamental considerations of operating system design.

An operating system has a tough job. First, it has to run many processes simultaneously, which it does by a bit of illusionism: a CPU can only execute one process at a time, so the OS switches the CPU between processes dozens or hundreds of times a second to create the appearance of simultaneity. Second, it needs to stop buggy or malicious processes from breaking security rules, such as by corrupting other processes’ memory or reading other users’ files. (In programmers' argot, a program is a file of executable code; a process is one instance of a running program.)

Operating systems safeguard data and processes from each other by claiming exclusive rights to many dangerous or abusable actions, such as interacting directly with most hardware components. User processes ("user process" just means "not the OS") can carry out a dangerous action only indirectly, by calling an OS-provided routine called a system call that surrounds the dangerous action with security checks.

One example that should illustrate the general problem is computer file systems. From a user's perspective, data on hard disks is organized into files that are grouped into nested directories (or "folders", to use the Windows term). Hard disks themselves, however, don’t have a concept of files and folders: at root, a hard disk is just a long sequence of numbered sectors that can each hold a fixed amount of data (usually 512 bytes), and CPU–hard disk interaction uses sector-level operations such as reading or overwriting a specified sector.

An OS can provide the abstraction of files and directories by reserving part of the hard disk for metadata that tell how to interpret the one-dimensional structure of the remainder of the hard disk as a hierarchy of trees and directories. In most file systems, the metadata are a list of index nodes or inodes, each of which contains information abot one file: for example, the file’s creation and last edit times, which disk sectors contain the file contents, and information about permissions—for instance, which users can read the file. Directories are special files that contain lists of the names and inode numbers of the files and subdirectories that they contain.

File system integrity and security, though, would be threatened if user processes could ask the hard drive to read or modify specific numbered sectors: such capabilities would let a buggy or malicious process corrupt inodes, for example, or read data from files they shouldn't be able to read. So the OS claims exclusive rights to give direct commands to the hard disk: if a user process wants to read or modify a file, it has to ask the OS to make orders to the hard disk on its behalf, by means of a specific OS-provided routine called a system call that checks that the process has permission to access the file it wants to access and updates file system metadata alongside the file itself.

File systems are one example of a more general problem: how can a computer run untrusted processes while limiting the damage they can do to each other or to data security? The solution is to build a permission mechanism into the CPU and coordinate it with the OS: CPU instructions that could be dangerous if misused are designated as privileged, and most user processes can only access privileged instructions indirectly.

On the x86 CPU architecture used on most consumer PCs, the CPU can switch between four rings numbered 0 through 3 (though most operating systems only use 0 and 3). When the CPU is in ring 0 or kernel mode, all instructions are enabled, but in ring 3 or user mode, privileged instructions such as direct interaction with most hardware components are disabled. A user-mode process can only enter kernel mode by executing a special instruction that toggles the CPU to kernel mode and runs a specified system call; the system call, in turn, finishes by switching back to user mode. If a process tries to run a privileged instruction in user mode, it triggers a CPU interrupt, a special event that switches the CPU to kernel mode but jumps to a special OS routine called an interrupt handler that (usually) kills the offending process. The same interrupt mechanism is triggered, with (usually) the same results, by other illegal instructions such as dereferencing a null pointer. After handling the interrupt, the OS restores the CPU to user mode and transfers control to another process.

What happens if the CPU tries to run an illegal instruction in kernel mode? Kernel mode, after all, allows privileged instructions, but not logically impossible instructions such as reading from a nonexistent memory address. In this case, the CPU still triggers an interrupt and jumps to the interrupt handler. But this time, the interrupt handler knows that the illegal instruction came from a process running in kernel mode—that is, either from the OS itself, or a process with unrestricted hardware access that might as well be part of the OS. And letting a buggy OS keep running could have disastrous consequences such as corrupting system data beyond recovery. So usually, the interrupt handler for an illegal instruction in kernel mode hits the scram button and starts a kernel panic (on Windows, another term is bug check) that shuts the system down posthaste. On Windows, the kernel panic routine creats the infamous blue screen of death: an error message on a blue background.

And the responsibilities of EDR software require it to run in kernel mode. EDR goes far beyond traditional antivirus programs that merely scan files to match them against known viruses; instead, they maintain constant surveillance of everything that is going on in every process on a computer, to the point of being essentially extensions of the OS itself.

CrowdStrike itself has a good explanation of the capabilities of EDR. EDR, according to CrowdStrike, "acts like a DVR on the endpoint, recording relevant activity to catch incidents that evaded prevention." As a result, "CrowdStrike tracks hundreds of different security-related events, such as process creation, drivers loading, registry modifications, disk access, memory access or network connections." Many of these actions, such as accessing a disk or a network, involve CPU instructions that user-mode software can access via system call: a program to monitor system calls, therefore, must insert code into the system call itself. Tracking processes' memory accesses, similarly, requires CrowdStrike to bypass operating system protections: all modern computers have hardware and OS protections that check all memory accesses to ensure that user processes cannot corrupt each other's memory.

The component of CrowdStrike Falcon that caused the outage, for instance, could not be implemented in user mode: monitoring of named pipes, or pseudo-files stored in RAM, rather than on disk, that can send data between processes. If one process writes some data to a named pipe, then other processes can later read this data out, using the same system calls are used to interact with files. Named pipes on Windows have had a number of security vulnerabilities (for example) that CrowdStrike Falcon is configured to catch. OS protections prevent one user mode process from directly accessing another process's RAM, so the contents of named pipes must be stored in the section of RAM used for the OS's own processes, and a program that accesses the data in named pipes outside of the ordinary system calls must have several OS-level permissions.

In summary, EDR software is essentially an extension of the OS and requires kernel-level permissions. As a consequence, EDR bugs can have the same consequences as OS bugs: a complete system crash or, worse still, compromise of system security. If CrowdStrike Falcon had been compromised not merely by inadvertence but, say, by a disloyal employee who sneaked malicious code into an update, it could have completely subverted OS security protections—for example, in order to exfiltrated the entirecontents of customers' hard disks over the Internet to a foreign intelligence service.

IV. The contributions of antitrust law

Microsoft and other OS developers have not been blind to the dangers of buggy EDR software, but Microsoft's ability to limit them has been limited by, of all things, European antitrust regulators. The story begins in 2006, when Microsoft announced that Windows Vista would have a component called PatchGuard, which restricts the ability of third-party programs to modify key kernel data structures; if PatchGuard detects an alteration, then it assumes the OS has been compromised triggers a kernel panic. PatchGuard would limit third-party software vendors' ability to make OS modifications, and Symantec and McAfee complained that PatchGuard would cripple their cybersecurity products and give special privileges to Microsoft's own offerings—something, as Symantec complained to European Union antitrust regulators, that would let Microsoft leverage their monopoly on operating systems into a monopoly in another field. (Just a few years before that, Microsoft had been involved in protracted antitrust litigation with the United States government, which argued that its bundling of Internet Explorer with Windows was an attempt to monopolize the Internet browser market.)

Ultimately, though Microsoft didn't get rid of PatchGuard wholesale (and more recent versions of Windows still include it), it did partially back down, allowing third-party cybersecurity vendors to install OS extensions or patches that bypassed certain security measures. Microsoft has more recently blamed the agreement that it reached with the EU, which prevents it from protecting its Windows to the same extent that Apple has locked down its own operating system. After reading the technical background above, you should be able to appreciate that this excuse is, at the very least, plausible and not an exercise in buck-passing. Microsoft's skepticism of third-party cybersecurity vendors was been vindicated only a few years later when another event with essentially the same causes as CrowdStrike happended in 2010:a defective McAfee update crashed PCs around the world that ran Windows XP.

V. The contributions of regulation

If security experts know that EDR software is an intrinsic security threat, why is it in such widespread use? One answer: because regulations all but demand it. Veterans of many industries can tell stories of regulation and liability triumphing over common sense, and the prevalence of EDR software owes a lot to regulations. United States federal agencies, for instance, are required to use EDR software by Executive Order 14028, issued in 2021. The California state legislature is currently considering AB 749, which would impose the same requirements on state legislative requirements. Similar regulatory issues exist in many private industries: for instance, the Federal Financial Institutions Examination Center, a United States federal agency that regulates banks, has a Cybersecurity Assessment Tool that spells out expectations for cybersecurity, including several provisions that require EDR-like monitoring. Though complicance with the Cybersecurity Assessment Tool is nominally voluntary, federal auditors are increasingly demanding compliance.

Many organizations, meanwhile, are led by people who view cybersecurity as a cost to be minimized so that the organization can get on with its real work, breeding an attitude that one startup founder called "security by checkbox compliance" that values out-of-the-box solutions. CrowdStrike's marketing and press releases brag about its compliance with FFIEC requirements as well as Executive Order 14028 (the latter of which, it notes proudly, mandates several security features that CrowdStrike itself was the first to bring to market), and the corporation spends hundreds of thousands of dollars per year on federal lobbying. But even organizations willing to build custom cybersecurity platforms may find auditors uncooperative: the path of least resistance is to use what they expect to see. As Mark Atwood, a senior software engineer at Amazon, noted on Twitter, "If you are in a regulated industry, you are required to install something like CrowdStrike on all your machines. If you use CrowdStrike, your auditor checks a single line and moves on. If you use anything else, your auditor opens up an expensive new chapter of his book."

VI. What can be done?

Ultimately, EDR software is a valiant but doomed attempt at a basically impossible task, securing general-purpose operating systems against any program that might run on them, and it opens up new security holes even as it closes old ones. Almost every computer in the world runs either on Windows, or on a descendant or reimplementation of the Unix operating system (including macOS, which has been a Unix variant under the hood ever since OS X was released in 2001). These are phenomenally complicated programs with millions of lines of code that could never be completely bug-free—as of this writing, the bug tracker for the Linux kernel, an open-source Unix reimplementation that is one of the most heavily scrutinized pieces of software in the world, lists 33 bugs reported the last week.

Much academic research on secure operating systems has focused instead on microkernels: operating systems with radically reduced scope that turn most components of traditional "monolithic" operating systems, such as the file system, into user-mode processes. Microkernels can be much more secure because when only thousands of lines of code run in kernel mode rather than several million, those several thousand lines can be checked much more carefully. Other components of a traditional kernel can run as services with restricted permissions and, correspondingly, a smaller blast radius when things go wrong. One team of researchers estimated in 2018 that a microkernel design would either eliminate or substantially mitigate the majority of a sample of bugs in the Linux kernel. One particularly interesting project is the seL4 microkernel, which is small enough to have a complete, line-by-line formal proof of several security guarantees, written in a computer-checkable proof language. Microkernels can have some performance costs, as they turn a single system call on a monolithic kernel into an interaction between multiple processes, though this is hardly relevant for most corporate IT programs that do little beyond database lookups and minimal processing of user input: the CPU on most consumer computers spends most of its time waiting on I/O requests and seldom approaches full utilization.

The bigger problem with microkernels is that adopting a new operating system means rewriting most of the software that lies on top of it. The seL4 security model, for instance, differs so drastically from that of Linux and Windows that running a typical corporate IT program on it would require rewriting not just the program itself but big chunks of a vast amount of underlying shared software, such as Internet and cryptography libraries. Such a task would be far beyond a typical corporate IT contractor.

Making existing microkernels into user-friendly complete operating systems would be a worthwhile project, but it's one that few organizations by themselves would have the resources to carry out; funding from a government agency or industry consortium would likely be necessary. In the meantime, though, a few policies could encourage safer computing and a shift away from intrinsically dangerous third-party security software; these could be accomplished by updates to government cybersecurity auditing rubrics such as the FFIEC Cybersecurity Assessement Tool, or by state and federal executive orders regarding executive branch procurement.

Encourage all software, especially security software that requires elevated privileges, to be written in memory-safe languages such as Rust. A report issued in February 2024 by the Office of the National Cyber Director included a recommendation for precisely this.
Encourage corporate IT systems to switch from Windows to Unix-like operating systems, which, though they also use a difficult-to-secure monolith design, at least have a justified reputation of being more secure than Windows. Ease of use for nontechnical users, historically one of the biggest reasons for choosing Windows, is much less of a concern now than it was 20 years ago: several distributions of Linux are just as easy to use as Windows today. At the very least, auditing rubrics such as the FFIEC CAT should no longer designate open-source software an intrinsic security risk. If this prejudice was ever justified, it certainly no longer is: the largest open-source projects now form load-bearing portions of the infrastructure of the most sophisticated technology firms in the world, which have every incentive to discover and fix bugs.
Increase the weighting of blunter but more foolproof methods of securing networks: air-gapping important systems (that is, keeping them disconnected from the Internet, with all data transfers done manually by USB device or similar) and, on remote devices that must have Internet access, configuring firewalls to block all access except to internal websites and databases.
Require cybersecurity software to allow users to disable automatic configuration file updates, or at least delay them by a few days. When when a potentially disastrous vulnerability is discovered, open-source projects have developed a good mechanism for getting urgent updates fixed: announce the existence of a bug without giving few details that would help attackers exploit it, and give a time when an update can be expected.

These conclusions are individually unsatisfactory half-measures, and I'm sorry I couldn't round off the essay with something more inspiring. But enough half-measures can add up to, if not quite a whole measure, then at least something close enough for most practical purposes. And the more important takeaways go beyond any technical specifics: organizations with complicated IT systems must learn that the buck for cybersecurity incidents ultimately stops with them, and regulators should examine their policies' role in creating a security monoculture and realize that replacing one security problem with a worse problem does not count as solving it.

Comments welcome: connorh94 at-sign gmail dot com or Twitter DMs @cmhrrs.