CrowdStrike Outage Puts Heightened Focus on New Release Testing and QA
By Scott Shinn
The recent CrowdStrike-caused BSOD outage on Microsoft systems has pulled focus on an ordinary but critical process: Extensive testing before rolling out software widely.
What happened in the CrowdStrike outage?
A bug in a CrowdStrike update has resulted in a global tech disruption for much of the airline industry as well as businesses, banks and federal agencies using CrowdStrike Falcon on Windows systems. Travel was affected, flights delayed, businesses, hospitals and doctor’s offices impacted, affecting appointments and treatment—including this author’s. The impact was felt Thursday, July 18, through Saturday, July 20, 2024 and has continued to hobble some travel plans up to the time of this article’s publication. The outage’s final outrage: There have also been reports and warnings related to targeted phishing schemes offering “support” for those affected following the outage.
After the backdoor attack on SolarWinds and its customers, the integrity of the global software supply chain has been a growing concern and was a strong driver in an 2021 U.S. Executive Order pressuring government organizations, infrastructure providers, and additional enterprises managing sensitive data or industrial control systems to monitor what’s in the software they are consuming. A related emerging software security audit process known as software bill of materials (SBOM) enables organizations to inventory the software components they are using and to more holistically and rapidly orchestrate file and system integrity monitoring and malware and vulnerability detection to all endpoints.
But the CrowdStrike outage was a fairly simple mistake, not an attack or intentional malicious insertion of a backdoor into the supply chain. Still, the defect in the release was bad enough that it required repair before CrowdStrike pushed out an update for Microsoft. Engineers had to physically shut down the system, then boot into safe mode, manually remove a set of files, and bring the system back up. It wasn’t as if the vendor could turn the security system off and then back on again and it would work.
I was at a doctor’s office at a local hospital, and every machine in the waiting room and reception, back to the MRI machine, every single system was knocked out. It turned a one-hour or two-hour appointment at the doctor into an eight-hour day for me.
The scope of the event wasn’t like what sometimes happens in an edge configuration with a very specific version using a specific feature. Instead, this was every version of Windows running CrowdStrike’s Falcon sensor software, unless you were fortunate to have not updated your software.
I’m a little shocked that something with this big of an impact was missed in testing. Running an update through an internal test lab test environment pipeline should have been able to catch the flaw, and there are a lot of tools that can automate this process to make sure that the software was going to be okay.
Focus on Fundamentals: Release Testing and Quality Assurance
This wasn’t the first time a software defect or blue screen of death (BSOD) has caused an outage and disruption and it won’t be the last.
For software vendors, in general, this kind of adverse defect could have been prevented through more careful pre-lease testing by the security vendor. These types of defective and disruptive releases won’t be a thing of the past anytime soon, and the event should serve as a wake-up call toward enhancing QA testing to mitigate supply chain, customer, and consumer disruption. And if the QA testing slows your software release down a bit, remember it can be what saves you and your customers from a lot of discomfort and disruption.
Those organizations that had an internal test lab or internal test process are in many cases saying, “Yeah, this didn’t affect us.” It didn’t affect them because they had an internal test process to check for bugs themselves. If you’re trying to justify your funding to your management, this is the time. And if you tested the CrowdStrike release and you decided you’re not going to push the new version with the defect, then kudos to you. This absolutely justified your existence, especially if you’re a QA tester.
For everybody else who didn’t do that, I hope you use this to your advantage to make a good case for your own internal QA processes. It’s the right answer to have ready for management because this won’t be the last time a defect is going to get released into the wild. You need a QA testing process of your own to validate everything that goes into your shop.
Learn more about Atomic OSSEC endpoint detection and response or schedule a demonstration.
Learn more about Atomicorp vulnerability detection.
Check out Atomicorp’s leading file integrity monitoring (FIM) for pinpointing what changed in your environment and how.