What Should Apple Users Take Away from the CrowdStrike Debacle?
In the New York Times coverage of the CrowdStrike update bug that wreaked havoc starting last Friday, there’s a lovely deadpan line eleven paragraphs in:
Apple and Linux machines were not affected by the CrowdStrike software update.
Even while sympathizing with those directly and indirectly affected, it’s hard not to be a little smug. The larger question is, could a similar kind of problem affect Macs? That would be bad for us Mac users but less so for the world, given that Macs are used in fewer mission-critical situations than Windows-based PCs. Macs may not even be as relied upon as iPads for vertical market tasks like point-of-sale applications, medical record tracking, and education management. What about iPhones? I have less of a sense of how mission-critical they are to businesses and other organizations, but there are certainly millions of individuals whose lives would be upended if their iPhones were suddenly bricked. They would have trouble communicating with others, making purchases, navigating to unfamiliar destinations, taking public transit, and much more.
At The Eclectic Light Company blog, Howard Oakley examines the possibility of Macs being affected by something similar. He concludes that the likelihood is increasingly unlikely overall and is no longer a significant risk for Apple silicon Macs. On Windows, CrowdStrike’s Falcon sensor code almost certainly runs as a kernel-mode driver with elevated privileges, which is why its bug can prevent a PC from booting successfully. On the Mac, the equivalent approach would require a kernel extension (kext), but Apple deprecated kexts starting in macOS 10.15 Catalina in 2019, pushing developers to use System Extensions instead. Kernel extensions can run on Apple silicon Macs only if the user drops system security to Reduced Security and explicitly allows third-party kexts to load. Don’t do that unless you have a really good reason.
In fact, the Mac version of CrowdStrike’s Falcon sensor reportedly used a kext on Intel-based Macs prior to macOS 11 Big Sur but has since switched to an EndpointSecurity System Extension. System Extensions run with standard user privileges, so even if one suffered from a critical bug, it shouldn’t be able to cause a kernel panic.
What about iOS and iPadOS? They’re even more secure than macOS because they have never allowed kernel extensions and don’t support anything like macOS’s System Extensions. All iOS and iPadOS apps are sandboxed, so they can’t affect the system or any other app. That’s not to say that iOS and iPadOS are perfectly secure or reliable, but they certainly rank highly among consumer-grade operating systems.
Apple devices may not be as vulnerable to a bug in an update to third-party software like CrowdStrike, but that doesn’t mean we can be complacent. Apple itself regularly releases updates, and while it’s essential to install them to patch security vulnerabilities, Apple’s engineers could make a mistake that would cause problems for millions. Howard Oakley’s article reminded me of when an Apple update inadvertently disabled Ethernet (see “El Capitan System Integrity Protection Update Breaks Ethernet,” 29 February 2016). Apple quickly addressed the problem, but the lack of Ethernet prevented some Macs from getting the revised update, requiring manual intervention.
What could happen to reduce the chances of an outage like this happening again?
- Punish CrowdStrike? It’s tempting to say that CrowdStrike should somehow be held liable for the potential costs, estimated to exceed $1 billion. However, in most cases, CrowdStrike’s standard terms limit the company’s liability to a refund for fees paid. There may still be shareholder lawsuits—CrowdStrike’s stock fell nearly 20%—and SEC scrutiny. Overall, it was a terrible, horrible, no good, very bad day for CrowdStrike, but it almost certainly doesn’t mean the end of the company. Other firms will probably be more careful for a while, but if the mistake doesn’t prove hugely expensive for CrowdStrike, everyone may return to old bad habits.
- Write better code? The easy answer is that the team in charge of developing the update should never have made the mistake in the first place. CrowdStrike’s environment and policies are unknown, but there are programming practices that reduce the likelihood of such errors. In an ideal world, more attention would be paid to code quality, but it can be difficult for management to prioritize code quality over shipping more quickly.
- Do better testing? Even if we give CrowdStrike the benefit of the doubt and say that the bug was a subtle mistake that could have slipped by any developer, I can’t see any excuse for why it wasn’t caught in testing. Either CrowdStrike wasn’t doing real-world testing—the company constantly releases patches like this—or someone messed up big time. As with writing better code, better testing is something everyone can agree should happen, but test teams may not be given the time or resources they need to do a good job.
- Use a staged rollout? Companies that release updates to numerous users don’t usually do so all at once. Instead, they release to small groups before expanding to the entire user base. That way, even if a bug has been introduced in development and slipped through testing, it won’t affect too many people before being discovered. Either CrowdStrike didn’t do this, or the problem affected only a subset of CrowdStrike users, so it could have been much worse. The only reason I can see why a company wouldn’t use staged rollouts is if it was patching a zero-day security vulnerability and felt that it was crucial to distribute to everyone as quickly as possible.
- Switch to Macs, iPads, and iPhones? It’s good to have an active fantasy life. The kinds of inexpensive workhorse computers and servers affected by the CrowdStrike bug are exactly what Apple isn’t interested in building.
Plenty of other lessons could be taken away from the CrowdStrike debacle, but I worry that it will fall out of the headlines too soon for other companies to learn from CrowdStrike’s mistakes.
Crowdstrike better hope that its lawyers wrote the clause protecting it really really carefully because otherwise they’re gonna get sued into oblivion.
The reason, in my opinion, that software quality is so poor is that we the consumers allow them to disclaim all responsibility for anything that goes wrong, including absence of liability for any resulting harms. The software doesn’t even have to be “fit for the purpose” of which it is sold!
For example, I quote from CrowdStrike’s Terms and Conditions:
Can you think of any other product that is allowed to say that it isn’t even guaranteed to perform its own purpose? For example, I’d be pretty upset if I bought a car that doesn’t drive. Or that if I bought a car seat and it caused the car’s wheels to come off.
But language like the above is standard in the software industry. MacOS probably says the same thing.
The lesson the world should take away is that it is time to disallow such exclusions.
Not a great example, given how many used cars are sold with an “as is” warranty…which means, that yes, you could buy a car that doesn’t drive.
(And don’t get me started on real estate, which has two entire industries — home inspectors and title search companies — whose entire reason for existence is to prevent customers from buying a house that will fall down from a person who doesn’t own the house).
It is, as said in the article, somewhat unrealistic to imagine that Macs could be utilised in place of Windows boxes, but surely there is another alternative? That would be for a version of Unix/Linux to be developed which is designed to be much more robust than Windows, which is, let’s face it, a decades old general-purpose operating system which is somewhat archaic in its inner structures.
It is pretty crazy that Windows, with all its architectural shortcomings, is employed for such mission-critical computer infrastructure across the western world.
I’m amazed that things were NOT worse. If it’s true that “the world runs on Windows,” and Crowdstrike is the 900 pound gorilla in its space, why didn’t the electric grid suffer, or the Social Security System go down, or the Postal Service? I asked at my local post office, and I was told they don’t use Windows, but rather an ancient OS. If Crowdstrike (the company) had to offer merchantability guarnatees, that might be in exchange for staged rollouts, rather than middle of the night installations when many moderate sized enterprises might not have anyone around to see that all the computers were entering doom loop reboots (or Crowdstrike could mandate that all customers have some “first tier machines” that were updated before wholesale installation on all of a company’s machines.
But what really amazed me was what DIDN’T break: power grids, gas utilities, water utilities, government operations, etc. The “why” in THAT bears careful investigation as a clue to how to make the next iteration less malignant rather than much worse.
Clicking on the section: “the user drops system security to Reduced Security”.
What would be the reasoning or applicable situations where one would do that?
Anybody have an answer to that? Thanks
In my era, a large company’s IT department major responsibility was testing new( or updates to existing ) vendor software thoroughly before implementing it live company-wide. It’s easy to blame the vendor and say they wrote the code and bug(s) or blame Microsoft for offering kernel access but, as this event shows, the end-user company must be aware of their own computer vulnerabilities( i.e. we’re running Windows and updates could take our machines down to BSOD ) and take all appropriate measures to prevent a catastrophe. Updates to critical software packages ( say SAP and similar ) were viewed skeptically until they proved themselves safe in a fenced playground.
In defence of CrowdStrike, the company has to balance putting out a security update immediately to meet a new and lethal threat or holding back the update until exhaustively tested with the chance of the threat running amok.
With security threats impacting on global computer and communications systems every second, it would appear that CrowdStrike does a pretty good job in countering these security threats. The better approach might be for companies and organisations to spend money on building in redundancy into their systems.
Not the same as CrowdStrike is the software OEM equivalent of GM, Ford, etc. while used car dealers are NOT the OEM of what they sell.
I think it’s entirely comparable.
Microsoft should explain why Windows allows kernel-mode drivers.
The travel, aviation, and medical sectors have been most visibly affected by the Crowdstrike/Microsoft mishaps. Millions of travelers, aircrews, and support personnel were (many are still stuck in airports). There are mountains of mishandled luggage in multiple airports from Auckland New Zealand, the long way around the World to Hawaii. This will take days to get the tightly wound systems back to normal.
The US Department of Transportation is holding airlines to comply with their rigid compensation policies f(same in the EU) for flight cancellation and delay expenses for their passengers. This will end up costing millions.
In the UK, thousands of medical appointments and operations were delayed or had to be rescheduled.
Some regulation is necessary to avoid a recurrence. If the industry does not propose reform on its own, it will be forced upon Crowdstrike and Microsoft. And yes, it is time to revise the disclaimers for software products to be in line with other forms of commerce.
It’s not only crazy, but downright dangerous for the world, AFAIAC. I’m talking about world wide airlines that run Windows. They carry thousands of people every day. And our own military, which itself has had numerous failures because of their dependency on Windows. Maybe those failures so far haven’t caused a catastrophic event–but who is to say what will happen in the future? To say that Macs would not be an infinitely better choice for mission-critical work to me is folly, and feeds into the falsehood that Macs are inferior.
The entire Social Security system did not go down but parts of it were definitely affected, including some of their website features.
