19/07/2024
Whoops. Thousands of Windows-based computers were rendered inoperable today by an update to a piece of software called CrowdStrike. This is a Cybersecurity tool which works, as they all do, by installing an agent on each computer subscribed to the service, in order to monitor the internal operations of that computer.
Organisations who have subscribed to the CrowsStrilke service will see their subscribed computers crash as soon as the update is installed, and will require manual intervention in each one to rectify the problem. Automated tools for doing this only work if the computer starts up normally, and the CrowdStrike issue prevents this from happening. So, each broken desktop, laptop, and server will need personal attention to remove the problematic update!
The fact that it affects servers is why the issue has been so widespread, and apparently unconnected internet services are all off-line - their services run on Windows servers, either in their premises or running in Cloud-based data centers, that have the CrowdStrike agent installed on them. Cloud-based servers are easy to fix remotely, but there are likely to be dozens, if not hundreds running for each service. On-premise servers can be easy to fix, if the remote access tools are sufficiently useful, but ultimately may require someone to traipse in to a data center with a keyboard, monitor and mouse, and fix each each one by hand.
Currently, most University services are unaffected, except ironically a couple of cloud-based services that IT Services use to provide their Helpdesk service.
At the moment, I'm leaning towards this being more Cockup than Conspiracy, as to cause this amount of trouble so quickly requires a fat-fingered programmer who is permitted access to the update process. That CloudStrike were so quickly able to identify the problem tells me they know pretty accurately where the problem was introduced, and in which particular bit of code.
This does, however, bring in very uncomfortable questions about how such a calamitous piece of code got in to the product. There should have been multiple stages of review and testing to catch exactly this type of snafu, and these seem to have been ineffective. There's also the current industry Agile development fad to contend with ("Move fast and break things") which seems to prioritise a speedy release schedule over code quality.
There is the possibility that this was something nefarious, a supply-chain attack, where black hats infiltrate a product's programming process to introduce bad code to either attack the infiltrated product directly or (as we saw with Solar Winds a few years ago) products or systems built with the surreptitiously-compromised tools. But given the speed of the incident, and the response by the vendor I see this as less likely than a coding or process cockup.
The really disturbing thing is that even though it's less likely to have been a malware event, it has now given the producers of malware more ideas about how to create havoc on demand. The more I think about it the more I'm at risk of concussion from facepalming so hard. Pushing to Production on a Friday! FFS!
As usual, The Register has a good roundup of the issues here, https://www.theregister.com/2024/07/19/crowdstrike_falcon_sensor_bsod_incident/?td=rt-3b
And there's the obligatory hilarious Reddit thread here - https://www.reddit.com/r/crowdstrike/comments/1e6vmkf/bsod_error_in_latest_crowdstrike_update/?share_id=igx-cOyqO-EV4VasDEn-Z&utm_name=androidcss
Falcon Sensor putting hosts into deathloop - but there's a workaround