In a root cause analysis posted Aug. 6, CrowdStrike said the massive outage last month that caused delays at airports worldwide and derailed hundreds of businesses was caused by an out-of-bounds memory read error beyond the end of the input data array.
In plain English, security pros say the outage that led to system crashes on numerous Windows devices was caused by a faulty sensor included in an update in CrowdStrike’s Falcon software.
“Essentially, a coding mistake slipped through the testing process and caused widespread disruption,” said Sarah Jones, cyber threat analyst at Critical Start.
An out of bounds memory error is not rare when it comes to programming errors that should be caught in QA, and can cause failures when they are not, explained John Gallagher, vice president of Viakoo Labs.
“The key issue here remains that this was a kernel-level interaction and therefore caused an overall system crash,” said Gallagher. “If this was contained to the user space it would cause an application crash, with limited damage to the user. Any vendor requiring kernel-level access should be either held to a much higher QA standard, or find ways to perform their function without it.”
According to the root cause analysis, Rapid Response Content (behavioral heuristics) is delivered through Channel Files and interpreted by the sensor’s Content Interpreter, using a regular-expression based engine. Each Rapid Response Content channel file gets associated with a specific Template Type built into a sensor release. The Template Type provides the Content Interpreter with activity data and graph context that’s matched against the Rapid Response Content.
CrowdStrike said the new interprocess communication (IPC) Template Type for Channel File 291 defined 21 input parameter fields, but the integration codes that invoked the Content Interpreter with Channel File 291’s Template Instances supplied only 20 input values to match against it.
So a large cause of the outage, said CrowdStrike, was the mismatch between the 21 inputs validated by the Content Validator versus the 20 provided to the Content Interpreter, causing the latent out-of-bounds read issue in the Content Interpreter. CrowdStrike acknowledged that another factor for the outage was the lack of a specific test for "non-wildcard matching criteria" in the 21st field.
“While this scenario with Channel File 291 is now incapable of recurring, it also informs process improvements and mitigation steps that CrowdStrike is deploying to ensure further enhanced resilience,” the root cause analysis states.
Critical Start’s Jones added that moving forward, it's reasonable to expect that CrowdStrike will implement more rigorous testing procedures. Jones said major incidents like this often serve as catalysts for significant improvements in quality assurance processes. Additionally, CrowdStrike likely has invested in or will invest in advanced software development methodologies and tools to identify potential issues earlier in the development cycle.
“However, it's important to note that completely eliminating errors from complex software systems is nearly impossible,” said Jones. “The goal is to minimize their occurrence and impact through robust testing and incident response plans.”
Nick France, chief technology officer at Sectigo, said the scale and severity of the error was compounded by the level of “trust” conferred to CrowdStrike’s software.
“This wasn’t some software utility you’d just run as a user like Excel or a game, but a system-level software kernel driver, which means that it has escalated privileges and as such, crashes and bugs have a more severe impact,” said France.
France added that while bugs in software are inevitable, even with privileged software, it’s critical to have solid development and deployment processes.
“QA, testing, partial rollouts, rollback ability – all these things could have helped reduce the impact if they were used, or better used, if they existed in some form,” said France.