On Friday, July 19, 2024, tens of thousands of workers in hospitals, airlines, banks and other industries around the world stared at the “Blue Screen of Death” (BSOD) when trying to complete daily essential tasks n their Windows computers.
The large-spread outage was caused on Windows systems running CrowdStrike's Falcon Sensor.
Let’s explore what happened and the lessons we can learn from the incident.
CrowdStrike and Falcon Sensor
CrowdStrike is a US-headquartered cybersecurity technology company. They offer a suite of cybersecurity software products, used by dozens of industries, including airlines, hospitals, banks, and retailers, to prevent hacks and data breaches.
Falcon Sensor is one of CrowdStrike's software products. It protects systems from cyber-attacks by monitoring computers for signs of malicious activity and helping to lock down threats.
What Happened on July 19
CrowdStrike pushes updates to Falcon Sensor automatically and silently, every Friday. In an update on July 24, the firm revealed the incident was caused by a Rapid Response Content update containing an undetected error. This resulted in crashes of machines running Microsoft Windows operating system and caused worldwide chaos.
The outages were not the result of a security incident or cyber-attack.
The incident had significant real-world impacts. Flights were canceled, broadcasters went off air, trains didn’t run and medical procedures were delayed worldwide. Frustrated workers were confronted with blue computer screens, with no available workaround or solution to get back online. Customers and consumers were left hanging and stranded.
At 5:29 EST on July 20, CrowdStrike put out a statement saying "the issue has been identified, isolated and a fix has been deployed." Organizations using Falcon Sensor are urged to manually deploy the fix to get their service back online.
Learnings from the CrowdStrike Incident
Software Developers
- Testing: Code can have flaws. This is precisely why the testing phase is vital. Unit testing, automated testing, and regressive testing are non-negotiable, as part of the software development lifecycle (SDLC). This is not even a secure development problem, but SDLC-101. While we wait on more information on the root cause from CrowdStrike, it is time to go back and look at the testing plans, procedures and environment.
- Automated silent full updates: Cybersecurity organizations are fighting a daily battle with cybercrime. They look at pushing out updates as soon as possible to consumers of their services, so that the software is always protected against the latest threats. However, the flip side to that is the risk of outages. One option is to look into a balanced update policy, which considers staggered updates.
Software End Users
- Allow kernel-level access: The reason Falcon Sensor could take down the entire Windows system, along with all other non-CrowdStrike software on it, is because Falcon Sensor has access to the system’s kernel. While cybersecurity vendors may tell you that it is essential to have access to the kernel to protect the system, allowing their-party software access to a system's kernel is essentially surrendering all control. Exploring non-invasive cybersecurity options should be part of the process before deciding on a cybersecurity vendor.
- Testing: Consumers of updates should also be carrying out testing before rolling out updates to their systems.
- Staggered updates: Consumers of software updates should have a staggered roll-out plan, which can help limit the damage.
The latest CrowdStrike update extensively notes the types of testing the company performs on its software. However, there is clearly a flaw in the testing plan as two of the newly developed template files were allowed to be packaged and deployed with the assumption that all would be OK, given everything else was previously tested. CrowdStrike, and the rest of the world, learned, though, that it wasn’t.
While the CrowdStrike CEO was emphatic about not calling this a cyber issue, and while it is primarily a development and SDLC-process issue, we often forget that the basic tenets of security have the “A” for availability. Any availability issue, then, becomes a cyber issue, especially when “availability” is at risk because of a cyber update.
It is vital that both software developers and end users learn lessons from this incident, to prevent such a widespread outage occurring in the future.