Adam Engst 22 July 2024 51 comments

What Should Apple Users Take Away from the CrowdStrike Debacle?

In the New York Times coverage of the CrowdStrike update bug that wreaked havoc starting last Friday, there’s a lovely deadpan line eleven paragraphs in:

Apple and Linux machines were not affected by the CrowdStrike software update.

Even while sympathizing with those directly and indirectly affected, it’s hard not to be a little smug. The larger question is, could a similar kind of problem affect Macs? That would be bad for us Mac users but less so for the world, given that Macs are used in fewer mission-critical situations than Windows-based PCs. Macs may not even be as relied upon as iPads for vertical market tasks like point-of-sale applications, medical record tracking, and education management. What about iPhones? I have less of a sense of how mission-critical they are to businesses and other organizations, but there are certainly millions of individuals whose lives would be upended if their iPhones were suddenly bricked. They would have trouble communicating with others, making purchases, navigating to unfamiliar destinations, taking public transit, and much more.

At The Eclectic Light Company blog, Howard Oakley examines the possibility of Macs being affected by something similar. He concludes that the likelihood is quite small overall and no longer significant for Apple silicon Macs. On Windows, CrowdStrike’s Falcon sensor code runs as a kernel-mode driver with elevated privileges, which is why its bug can prevent a PC from booting successfully. On the Mac, the equivalent approach would require a kernel extension (kext), but Apple deprecated kexts starting in macOS 10.15 Catalina in 2019, pushing developers to use System Extensions instead. Kernel extensions can run on Apple silicon Macs only if the user drops system security to Reduced Security and explicitly allows third-party kexts to load. Don’t do that unless you have a really good reason.

In fact, the Mac version of CrowdStrike’s Falcon sensor reportedly used a kext on Intel-based Macs prior to macOS 11 Big Sur but has since switched to an EndpointSecurity System Extension. System Extensions run with standard user privileges, so even if one suffered from a critical bug, it shouldn’t be able to cause a kernel panic.

What about iOS and iPadOS? They’re even more secure than macOS because they have never allowed kernel extensions and don’t support anything like macOS System Extensions. All iOS and iPadOS apps are sandboxed, so they generally can’t affect the system or any other app. That’s not to say that iOS and iPadOS are perfectly secure or reliable, but they certainly rank highly among consumer-grade operating systems.

Apple devices may not be as vulnerable to a bug in an update to third-party software like CrowdStrike, but that doesn’t mean we can be complacent. Apple itself regularly releases updates, and while it’s essential to install them to patch security vulnerabilities, Apple’s engineers could make a mistake that would cause problems for millions. Howard Oakley’s article reminded me of when an Apple update inadvertently disabled Ethernet (see “El Capitan System Integrity Protection Update Breaks Ethernet,” 29 February 2016). Apple quickly addressed the problem, but the lack of Ethernet prevented some Macs from getting the revised update, requiring manual intervention.

What could happen to reduce the chances of an outage like this happening again?

Punish CrowdStrike? It’s tempting to say that CrowdStrike should somehow be held liable for the potential costs, estimated to exceed $1 billion. However, in most cases, CrowdStrike’s standard terms limit the company’s liability to a refund for fees paid. There may still be shareholder lawsuits—CrowdStrike’s stock fell nearly 20%—and SEC scrutiny. Overall, it was a terrible, horrible, no good, very bad day for CrowdStrike, but it almost certainly doesn’t mean the end of the company. Other firms will probably be more careful for a while, but if the mistake doesn’t prove hugely expensive for CrowdStrike, everyone may stick with current bad behaviors.
Write better code? The easy answer is that the team in charge of developing the update should never have made the mistake in the first place. CrowdStrike’s environment and policies are unknown, but there are programming practices that reduce the likelihood of such errors. In an ideal world, more attention would be paid to code quality, but it can be difficult for management to prioritize code quality over shipping more quickly.
Do better testing? Even if we give CrowdStrike the benefit of the doubt and say that the bug was a subtle mistake that could have slipped by any developer, I can’t see any excuse for why it wasn’t caught in testing. Either CrowdStrike wasn’t doing real-world testing—the company constantly releases updates like this—or someone messed up big time. As with writing better code, better testing is something everyone can agree should happen, but test teams may not be given the time or resources they need to do a good job.
Use a staged rollout? Companies that release updates to very large numbers of users don’t usually do so all at once. Instead, they release to small groups before expanding to the entire user base. That way, even if a bug has been introduced in development and slipped through testing, it won’t affect too many people before being discovered. Either CrowdStrike didn’t do this, or the problem affected only a subset of CrowdStrike users, so it could have been much worse. The only reason I can see why a company wouldn’t use staged rollouts is if it was patching a zero-day security vulnerability and felt that it was crucial to distribute to everyone as quickly as possible.
Switch to Macs, iPads, and iPhones? It’s good to have an active fantasy life. The kinds of inexpensive workhorse computers and servers affected by the CrowdStrike bug are exactly what Apple isn’t interested in building.

Plenty of other lessons could be taken away from the CrowdStrike debacle, but I worry that it will fall out of the headlines too soon for other companies to learn from CrowdStrike’s mistakes.

Comments About What Should Apple Users Take Away from the CrowdStrike Debacle?

Notable Replies

silbey

22 July 2024

Crowdstrike better hope that its lawyers wrote the clause protecting it really really carefully because otherwise they’re gonna get sued into oblivion.
Michael Schmitt

22 July 2024

The reason, in my opinion, that software quality is so poor is that we the consumers allow them to disclaim all responsibility for anything that goes wrong, including absence of liability for any resulting harms. The software doesn’t even have to be “fit for the purpose” of which it is sold!

For example, I quote from CrowdStrike’s Terms and Conditions:

CROWDSTRIKE AND ITS AFFILIATES DISCLAIM ALL OTHER WARRANTIES, WHETHER EXPRESS, IMPLIED, STATUTORY OR OTHERWISE. TO THE MAXIMUM EXTENT PERMITTED UNDER APPLICABLE LAW, CROWDSTRIKE AND ITS AFFILIATES AND SUPPLIERS SPECIFICALLY DISCLAIM ALL IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE, AND NON-INFRINGEMENT WITH RESPECT TO THE OFFERINGS AND CROWDSTRIKE TOOLS. THERE IS NO WARRANTY THAT THE OFFERINGS OR CROWDSTRIKE TOOLS WILL BE ERROR FREE, OR THAT THEY WILL OPERATE WITHOUT INTERRUPTION OR WILL FULFILL ANY OF CUSTOMER’S PARTICULAR PURPOSES OR NEEDS.

Can you think of any other product that is allowed to say that it isn’t even guaranteed to perform its own purpose? For example, I’d be pretty upset if I bought a car that doesn’t drive. Or that if I bought a car seat and it caused the car’s wheels to come off.

But language like the above is standard in the software industry. MacOS probably says the same thing.

The lesson the world should take away is that it is time to disallow such exclusions.
silbey

22 July 2024

Michael Schmitt:

For example, I’d be pretty upset if I bought a car that doesn’t drive

Not a great example, given how many used cars are sold with an “as is” warranty…which means, that yes, you could buy a car that doesn’t drive.

(And don’t get me started on real estate, which has two entire industries — home inspectors and title search companies — whose entire reason for existence is to prevent customers from buying a house that will fall down from a person who doesn’t own the house).
CheckMeter

22 July 2024

It is, as said in the article, somewhat unrealistic to imagine that Macs could be utilised in place of Windows boxes, but surely there is another alternative? That would be for a version of Unix/Linux to be developed which is designed to be much more robust than Windows, which is, let’s face it, a decades old general-purpose operating system which is somewhat archaic in its inner structures.

It is pretty crazy that Windows, with all its architectural shortcomings, is employed for such mission-critical computer infrastructure across the western world.
jsrnephdoc

22 July 2024

I’m amazed that things were NOT worse. If it’s true that “the world runs on Windows,” and Crowdstrike is the 900 pound gorilla in its space, why didn’t the electric grid suffer, or the Social Security System go down, or the Postal Service? I asked at my local post office, and I was told they don’t use Windows, but rather an ancient OS. If Crowdstrike (the company) had to offer merchantability guarnatees, that might be in exchange for staged rollouts, rather than middle of the night installations when many moderate sized enterprises might not have anyone around to see that all the computers were entering doom loop reboots (or Crowdstrike could mandate that all customers have some “first tier machines” that were updated before wholesale installation on all of a company’s machines.

But what really amazed me was what DIDN’T break: power grids, gas utilities, water utilities, government operations, etc. The “why” in THAT bears careful investigation as a clue to how to make the next iteration less malignant rather than much worse.
Paul Albert

22 July 2024

Clicking on the section: “the user drops system security to Reduced Security”.
What would be the reasoning or applicable situations where one would do that?
Anybody have an answer to that? Thanks
Brian S

22 July 2024

In my era, a large company’s IT department major responsibility was testing new( or updates to existing ) vendor software thoroughly before implementing it live company-wide. It’s easy to blame the vendor and say they wrote the code and bug(s) or blame Microsoft for offering kernel access but, as this event shows, the end-user company must be aware of their own computer vulnerabilities( i.e. we’re running Windows and updates could take our machines down to BSOD ) and take all appropriate measures to prevent a catastrophe. Updates to critical software packages ( say SAP and similar ) were viewed skeptically until they proved themselves safe in a fenced playground.
Michel Hedley

22 July 2024

In defence of CrowdStrike, the company has to balance putting out a security update immediately to meet a new and lethal threat or holding back the update until exhaustively tested with the chance of the threat running amok.

With security threats impacting on global computer and communications systems every second, it would appear that CrowdStrike does a pretty good job in countering these security threats. The better approach might be for companies and organisations to spend money on building in redundancy into their systems.
Dennis Swaney

22 July 2024

silbey:

Not a great example, given how many used cars are sold with an “as is” warranty…which means, that yes, you could buy a car that doesn’t drive.

Not the same as CrowdStrike is the software OEM equivalent of GM, Ford, etc. while used car dealers are NOT the OEM of what they sell.
silbey

22 July 2024

I think it’s entirely comparable.
Jonathan Dodds

22 July 2024

Microsoft should explain why Windows allows kernel-mode drivers.
Bob Guy

22 July 2024

The travel, aviation, and medical sectors have been most visibly affected by the Crowdstrike/Microsoft mishaps. Millions of travelers, aircrews, and support personnel were (many are still stuck in airports). There are mountains of mishandled luggage in multiple airports from Auckland New Zealand, the long way around the World to Hawaii. This will take days to get the tightly wound systems back to normal.

The US Department of Transportation is holding airlines to comply with their rigid compensation policies f(same in the EU) for flight cancellation and delay expenses for their passengers. This will end up costing millions.

In the UK, thousands of medical appointments and operations were delayed or had to be rescheduled.

Some regulation is necessary to avoid a recurrence. If the industry does not propose reform on its own, it will be forced upon Crowdstrike and Microsoft. And yes, it is time to revise the disclaimers for software products to be in line with other forms of commerce.
[email protected]

22 July 2024

It’s not only crazy, but downright dangerous for the world, AFAIAC. I’m talking about world wide airlines that run Windows. They carry thousands of people every day. And our own military, which itself has had numerous failures because of their dependency on Windows. Maybe those failures so far haven’t caused a catastrophic event–but who is to say what will happen in the future? To say that Macs would not be an infinitely better choice for mission-critical work to me is folly, and feeds into the falsehood that Macs are inferior.
[email protected]

22 July 2024

The entire Social Security system did not go down but parts of it were definitely affected, including some of their website features.
Frans

23 July 2024

jsrnephdoc:

I’m amazed that things were NOT worse. If it’s true that “the world runs on Windows,” and Crowdstrike is the 900 pound gorilla in its space, why didn’t the electric grid suffer, or the Social Security System go down, or the Postal Service?

Good question. Part of an answer to that is that it’s not true that the whole world runs on Windows, fortunately there are many alternative OS-es in use. Furthermore, not every company using Windows uses CrowdStrike to protect their systems. Also, as I understood, only Windows 10 and 11 were affected, not older versions which can still be in use.
Frans

23 July 2024

Michel Hedley:

In defence of CrowdStrike, the company has to balance putting out a security update immediately to meet a new and lethal threat or holding back the update until exhaustively tested with the chance of the threat running amok.

I agree they have to balance speed to meet a threat with the time it costs to test their code. Having said that, I have not read anything about any lethal threat that necessitated an immediate update, but I could be wrong (I’m not a security expert, I’m a software engineer). But even then, at least some basic testing should have been performed. The fact that so many systems were affected by this bug and the impact of it raises the question why this bug was not caught? Simply installing it on a few different systems would have revealed that at least some of them would no longer boot up, no intensive testing required. IMHO CrowdStrike has acted irresponsibly and needs to review their procedures ASAP.
Nello Lucchesi

23 July 2024

Jonathan Dodds:

Microsoft should explain why Windows allows kernel-mode drivers.

According to the Wall Street Journal, a 2009 agreement with the European Union requires Microsoft to allow kernel extensions:

A Microsoft spokesman said it cannot legally wall off its operating system in the same way Apple does because of an understanding it reached with the European Commission following a complaint. In 2009, Microsoft agreed it would give makers of security software the same level of access to Windows that Microsoft gets.

Source (Gift Link): https://www.wsj.com/tech/cybersecurity/microsoft-tech-outage-role-crowdstrike-50917b90?st=ddhtag1onr5oqck&reflink=desktopwebshare_permalink

Mashable says that this is the EU agreement referred to in that quote:

Stories – 16 Dec 09

Microsoft Statement on European Commission Decision - Stories

Brad Smith, Senior Vice President and General Counsel, on the European Commission’s decision in favor of formal adoption of measures Microsoft has offered to address competition law issues.

Est. reading time: 3 minutes

Source:
Mashable – 23 Jul 24

Microsoft says EU rules made CrowdStrike outage possible

Macs do not have this problem.
silbey

23 July 2024

And here I thought Apple was doing the “locked down” thing for its operating systems to sell more product.
Nello Lucchesi

23 July 2024

Frans:

jsrnephdoc:

… why didn’t the electric grid suffer, or the Social Security System go down, or the Postal Service?

Also, as I understood, only Windows 10 and 11 were affected, not older versions which can still be in use.

Yes, apparently running on Windows 3.1 is what saved Southwest Airlines

But Southwest reported that its operations were completely unaffected.

That’s because major portions of the airline’s computer systems are still using Windows 3.1, a 32-year-old version of Microsoft’s computer operating software. It’s so old that the CrowdStrike issue doesn’t affect it so Southwest is still operating as normal. It’s typically not a good idea to wait so long to update, but in this one instance Southwest has done itself a favor.

Source:
GovTech – 19 Jul 24

Why isn’t Southwest affected by the CrowdStrike/Microsoft outage?

Answer: Because it’s using an outdated version of Windows.
silbey

23 July 2024

Nello Lucchesi:

Yes, apparently running on Windows 3.1 is what saved Southwest Airlines

Interesting watching fake news get created in real time. The source of the “Southwest uses Win 3.1” was a random guy on twitter making a joke, which Yahoo then picked up, and then the GovTech site got it from Yahoo. If you follow the GovTech links to the Yahoo page, you find this one:

twitter.com

Artem Russakovskii (@ArtemR) on X

@ArtemR

Delta, United, American Airlines flights are all grounded right now. The reason Southwest is not affected is because they still run on Windows 3.1. https://t.co/ezFubvKVNA

About which he later notes: “To be clear, I was trolling last night…Yahoo News is quoting me as a source. This is getting out of control.”
Halfsmoke

23 July 2024

jsrnephdoc:

But what really amazed me was what DIDN’T break: power grids, gas utilities, water utilities, government operations, etc.

A major factor is that many government agencies and service providers do not run systems modern enough to use Crowdstrike software or OS’s that would be affected by the Crowdstrike update crisis.

One recent example, among many:
Ars Technica – 10 Apr 24

5.25-inch floppy disks expected to help run San Francisco trains until 2030

"We have a technical debt that stretches back many decades."

and, also in Northern California:
The Mercury News – 17 Sep 22

How clever mechanics keep 50-year-old BART trains running: Windows 98, eBay,...

Keeping BART running is far from easy. Mechanics rely on frankensteined laptops operating Windows 98, train yard scraps, and vintage computer chips.
nextstep

23 July 2024

Exactly. Crowdstrike was able to do this because Microsoft permits developers to modify the Windows kernel; Apple no longer permits access to the kernel.
Jonathan Dodds

23 July 2024

Thanks for that information and for the links.

To summarize, Microsoft is explaining that the European Commission is culpable for the CrowdStrike outage.
Jaimes

23 July 2024

In February 2024, Crowdstrike announced layoffs in the USA and moved most tech jobs to low-cost India. They proudly announced this via a press release. In a stark reminder that history doesn’t repeat, but it does echo, George Kurtz who is CEO of Crowdstrike was the CTO of McAfee when it famously launched The Great McAfee XP Bricking Fiasco of 2010. Coincidence or simply confirmation that dross floats to the top?
Nello Lucchesi

23 July 2024

Apparently prior CrowdStrike updates brought Linux systems down.

Neowin

CrowdStrike broke Debian and Rocky Linux months ago, but no one noticed

CrowdStrike recently caused a widespread Blue Screen of Death (BSOD) issue on Windows PCs, disrupting various sectors. However, this was not an isolated incident, CrowdStrike affected Linux PCs also.
Halfsmoke

24 July 2024

Bugs and mistakes have no connection to nationality or countries. Anytime humans are involved with something, no matter the geographic location, errors will occur.
John Man

24 July 2024

What I saw mentioned somewhere but doesn’t seem to be mentioned much is that Crowdstrike have also in the past done good for not only Windows users but also Mac users in (as I understand it) discovering and reporting security vulnerabilities in macOS etc. For example:

Apple Support

About the security content of macOS Sonoma 14.5

This document describes the security content of macOS Sonoma 14.5.

" AppleVA

Available for: macOS Sonoma

Impact: Processing a file may lead to unexpected app termination or arbitrary code execution

Description: The issue was addressed with improved memory handling.

CVE-2024-27829: Amir Bazine and Karsten König of CrowdStrike Counter Adversary Operations, and Pwn2car working with Trend Micro’s Zero Day Initiative"

etc
Adam Engst

24 July 2024

Here’s CrowdStrike’s public explanation of what happened and how they intend to prevent it in the future with more testing (duh!) and staged rollouts (double-duh!) with customer control and release notes.

crowdstrike.com

Falcon Content Update Remediation and Guidance Hub | CrowdStrike

Access consolidated remediation and guidance resources for the CrowdStrike Falcon content update affecting Windows hosts.

Est. reading time: 1 minute
Michael Schmitt

24 July 2024
My summary of the long posting…

CrowdStrike posted a Preliminary Post Incident Review.

It basically says that they don’t test Rapid Response Content (the channel file update that was pushed). What’s supposed to happen is:
1. The Content Validator (in the cloud) is supposed to perform validation checks on the file before it is published
2. The Content Interpreter (on the machine) is supposed to “gracefully handle exceptions from potentially problematic content”
What actually happened was:
1. Due to a bug in the Content Validator, the bad content data passed validation
2. The bad content data caused an out-of-bounds memory read in the Content Interpreter
3. The out-of-bounds memory triggered an exception
4. The “unexpected exception could not be gracefully handled”…
5. …resulting in a BSOD
The way they plan to prevent this from happening again is to do the things that they should have been doing all along, such as:
- Test Rapid Response Content before deployment
- Validate harder
- Improve error handling
- Stagger deployment instead of everywhere all at once
- Start with a “canary deployment” (to a machine that’s sole purpose is to see if it goes wrong)
- Monitor whether the deployment causes problems
- Allow customers to control when the Rapid Response Content is deployed
- Document what they’re releasing
A full Root Cause Analysis is forthcoming.
MacGuyver

24 July 2024

So the usual “who/what is to blame” in big events, followed by analyses and promises to do better… then relaxing over time due to the need for reducing costs and increasing speed/output, or perhaps the simple human response to periods of non-crisis.

This was a big public-facing issue that got a lot of press… and it didn’t surprise me in the slightest. However, blaming the EU or thinking macOS is immune to this or any other angle is missing the real point: Single-point-of-failure design requires greater testing and controls to avoid critical / widespread crises.

Beyond that, regulatory systems often (if at all) levy meager fines at any corporation that commits a grievous error which inflicts damages to other parties in the form of lost time, money and resources. Pay and move on. Nothing learned. Penalty is the cost of doing business.

Anyone remember the February AT&T wireless failure from a botched update?

FCC Public Safety and Homeland Security Bureau analyzed network outage reports and written responses submitted by AT&T and interviewed AT&T employees. The bureau’s report said:

The Bureau finds that the extensive scope and duration of this outage was the result of several factors, all attributable to AT&T Mobility, including a configuration error, a lack of adherence to AT&T Mobility's internal procedures, a lack of peer review, a failure to adequately test after installation, inadequate laboratory testing, insufficient safeguards and controls to ensure approval of changes affecting the core network, a lack of controls to mitigate the effects of the outage once it began, and a variety of system issues that prolonged the outage once the configuration error had been remedied.

Ars Technica – 23 Jul 24

AT&T failed to test disastrous update that kicked all devices off network

AT&T caused outage that blocked 92 million calls, 25,000 attempts to reach 911.

This was more than just a bad patch. It was a systematic failure. The Ars Technica article also mentions a similar Verizon outage from December that only lasted a couple hours in certain states due to a similar lack of process compliance.

I have no love for AT&T, having witnessed firsthand their lumbering (dis)organization made up of too many parts that often do not communicate or work with each other on important things like updates. Two examples:

A large regional AT&T team that manages fiber backbone & Enterprise connections has described to me how they periodically have a day where hundreds of new tickets (work orders) appear and waste hours of their time when the tickets turn out to be re-opened issues from the past that had been closed. This sometimes occurs after updates pushed from another AT&T division. Their manager has explained this to the higher ups and the source departments, even begging them to at least inform the team when an update is pushed so they can more quickly determine if it is going to be one of those days… And for years AT&T has changed almost nothing about the process, and never giving them notice of an update.

I could relate more of these stories, (like spending 2 days on the phone talking to 12-14 departments/divisions to get a client’s yahoo email saved when they cancelled an old “Business” DSL/phone account… all it took was someone un-checking a box on their screen), but I think it really comes down to:

Does the threat of penalty dissuade a business from doing or NOT doing something?

Has anyone followed the recent Boeing saga?
David C.

24 July 2024

Dave Plummer (veteran Microsoft system software developer) shared videos explaining what happened from a technical standpoint. For the benefit of those who may find this of interest:
Nello Lucchesi

24 July 2024

MacGuyver:

… blaming the EU or thinking macOS is immune to this or any other angle is missing the real point …

I think this is a pretty exhaustive list of who’s/what’s been blamed and why the blame is misplaced:

Stratechery by Ben Thompson – 22 Jul 24

Crashes and Competition

The recent Windows crashes are downstream from trying to encourage competition in areas Microsoft should have never made open to begin with, highlighting the challenge for regulators.
Simon

25 July 2024

Children are taught from an early age that it’s risky to put all your eggs in the same basket. Yet corporate IT around the globe has done that for years, if not decades. They all love Windows and MS Office and they always end up flocking to the same corporate security solutions. Then when that same basket they’re all holding develops a crack and all the eggs fall out the stunned gasping and pearl clutching starts. But why? It’s the entirely expected outcome from making the same simple mistake over and over again. The same mistake every child is taught not to make. You’d think these billion $ departments full of supposedly smart people wouldn’t trip over such a mundane obstacle. Yet here we are. The same crowd that otherwise misses no opportunity to display virtue by calling for diversity, is stunned when the complete lack of diversity leads to meltdown. Shocker.
ssteiff

25 July 2024

IMHO the question isn’t what should Apple users take away from the CrowdStrike debacle, but rather, what should every user and corporate IT/Security manager take away from it?
Yeah, The CrowdStrike bug did not affect Mac/Linux users. But it also didn’t affect Windows users who did not install CrowndStrike. It affected primarily corporate users whose corporate IT/Security officers selected the CrowdStrike solution and enabled automatic updates. So another list of users not affected by the debacle are corporate users who had CrowdStrike installed, but whose IT/Security officers have disabled auto-update.
The problem lies with over-centralization of the corporate security. Corporate Mac users were saved this time but if a Mac user hands over control of their Mac to corporate IT - sooner or later disaster will strike. A disaster that affects Macs too. It’s just a matter of time. Be that CrowdStrike, bitlocker, NetSkope, or another centrally managed security solution. IT/Security officers love the feeling of being able to control everything and fight off naughty users who dare plug a USB stick into their laptops. Obviously, they all have good intentions in mind. They mean good. They want to protect corporate assets. Even if that leads to getting all Windows machines off the grid for hours. A friend of mine shared with me, in real-time, how the CrowdStrike share price dropped during the debacle and I replied: “Forget about THEIR share price. Look at OUR sales. Look at the airlines that could not get a plane off the ground. Look at the banks that could not transact. We’re talking billions. Not just one company!”.

I use a Mac for my corporate work, but it’s a BYOD so I did not install CrownStrike or bitlocker or whatever our IT wanted me to install. I was exempt. I am participating in our NetSkope SASE pilot and hate this product, not because it does a bad job, but because it blocks too many sites, the AI engine isn’t that intelligent, and I have to manually ask our IT to manually configure the NetSkope gateways to allow those sites through. The good thing is that I can disable it when I want, and revert to our good-old VPN when I need to connect from away into our corporate intranet. Am I really happy with it? Not at all! NetSkope is invasive. It installed itself on my Mac in such a manner that it affects all users. I do keep a personal user on my Mac for non-corporate stuff and I find that NetSkope item at the top of my screen even if I boot into that non-corporate user. I can turn it off, but it’s on by default. Annoying. Invasive. Unwanted.

Where am I taking that? I leave it up to IT and Security officers to do their homework and rethink their security strategies. But for us, mortals, the simple users - my advice is - stay away from those corporate-controlled security solutions and keep your devices secure using user-end products.

I told that friend of mine from the CrowdStrike share story (who also offered quite a few useful tips that helped a lot of my colleagues restore their machines, but is also a die-hard fan of tight centralized IT control of our lives) that the day they force me to surrender admin of my Mac to corporate IT will be the day I announce my early retirement.
Nello Lucchesi

25 July 2024

Apparently CrowdStrike is offering its partners a $10 Uber Eats gift card as an apology.

TechCrunch – 24 Jul 24

CrowdStrike offers a $10 apology gift card to say sorry for outage | TechCrunch

Several people who received the CrowdStrike offer found that the gift card didn't work, while others got an error saying the voucher had been canceled.

Est. reading time: 4 minutes
John

26 July 2024

Interesting. In the UK goods sold must be ‘fit for purpose’, regardless of what the manufacturer’s lawyers might say. This is one of the things I’ve always thought explains why products are more expensive in Europe than the US.
John

26 July 2024

My takeaway is that the concept of auto-updates is fundamentally flawed. The idea that someone is able to change my machine at their convenience, without me checking what they are going to do or finding the right time, is nuts, IMHO. I hope this incident means I don’t have to explain why, and this wasn’t even a supply chain malicious attack.

I and a number of others have had an ongoing discussion on github with Brave about that, which is grand fun. Blah blah blah ‘but security and urgency’ is their reasoning for jumping with both feet and eyes closed.
26 July 2024

CheckMeter:

It is pretty crazy that Windows, with all its architectural shortcomings, is employed for such mission-critical computer infrastructure across the western world

Not only does Windows have many, many millions more users than Apple are, most of Windows’ hardware and software products usually cost a lot less than Apple’s.

This CrowdStrike disaster is one example that cheaper is not necessarily better. Their services have not had any problems with their Apple users.
Nello Lucchesi

28 July 2024

Microsoft’s Incidence Response is a very technical explanation of the CrowdStrike failure.

Microsoft Security Blog – 27 Jul 24

Windows Security best practices for integrating and managing security tools |...

We examine the recent CrowdStrike outage and provide a technical overview of the root cause.

Est. reading time: 16 minutes
Will M

28 July 2024

Nello Lucchesi:

Apparently CrowdStrike is offering its partners a $10 Uber Eats gift card as an apology.

Several people who received the CrowdStrike offer found that the gift card didn’t work, while others got an error saying the voucher had been canceled.

Priceless.
David C.

28 July 2024
Dave Plummer’s videos came to the same conclusion. The CS kernel driver tried to dereference a bogus memory pointer, which caused the crash. And it seems directly related to the fact that one of CS’s data files was all-zeros - almost certainly a mistake.

So there’s some shared blame to go around here:
- CS’s kernel driver should be validating data files as a part of loading them, before using their content. Especially because (it seems) that the content of these files affects the behavior of the (audited, approved and signed) kernel driver software.
- CS should be testing their releases. If they had tested this update on an in-house computer, they would almost certainly have seen the crash and fixed it before it got out to the rest of the world. It takes less than an hour to do a test like this. It is going to take weeks for the rest of the world to clean up their mess.
- Microsoft should probably update the WHQL program to require similar auditing and signing of data files that affect kernel drivers in this way. Even if doing so will delay the release of a security update. CS has proven that you can’t trust vendors like this.
- The laws in the EU need to change. The Mac version of CS doesn’t have this problem because it uses Apple’s security framework instead of a kernel driver.
  
  Microsoft allegedly developed something similar for Windows but European lawmakers decided that they couldn’t use it because it was going to be restricted to only approved security companies, which would be illegal anti-competitive behavior. Maybe this mess will convince them to reconsider their decision? Probably not, but one can always hope.
  
  Sadly, I think that the EU may actually respond by outlawing Apple’s API, in the name of “fairness”. They seem to enjoy carefully analyzing every problem and mandating the dumbest possible of all “solutions”.
deemery

29 July 2024

I have been calling for liability for bad software and licensing for software engineers for -decades- (in part to establish and in part to limit liability - the same way it works for civil engineers). The computer professional societies have opposed this, even though they would be instrumental in establishing licensing terms and procedures. I remember one exchange with someone who in most other matters I really respected. “You mean licensing, like barbers?” “No. I mean licensing like doctors and civil engineers.”

But CrowdStrike blaming their -test software- for what was clearly failures in design, in coding, and in release management done by developers is truly appalling. Microsoft, though, has trained us all to believe “software failure is inevitable.”

We do know ways to produce better software. Better, less error-prone programming languages is one approach. So is the use of formal methods and assertions, etc. You don’t necessarily have to prove the entire program correct to gain substantial benefits from adopting these techniques.
stottm

29 July 2024

ALL of the suggested improvements CrowdStrike is promising to make should have been in place long ago. This is not rocket science, these best practices were not rigidly followed and that’s why the outage occurred.

In June, CrowdStrike sent a different faulty channel update to Enterprise Linux RHEL, Debian, Ubuntu all had many kernel panics. That was almost 30 days later when they did it to Windows! It’s not supposed to be possible to crash the Linux kernel when writing a kernel module. So I am not sure how or why it did manage to cause Linux kernel panics.

One could make a legal argument based on negligence and that would supersede any EULA legal junk. If the courts accept such a lawsuit, CrowdStrike is going to be in serious financial trouble. The CEO likely will be replaced.

The Falcon software receives rapid response updates that are like virus definitions on steroids. This faulty update was one of that style of update. It’s a major selling point feature of Falcon. The majority of instructions to the Falcon sensor are included with software upgrades of the Falcon sensor. The rapid response updates are intended to provide protection against an in-the-wild and actively being exploited attack. This particular update was to protect against bad actors abusing a Named Pipes developer feature in Windows. The CrowdStrike issue was not a software update but it does impact the software. When Falcon reads this faulty content at boot, it crashes the Windows kernel causing the BSOD and an automatic reboot. Removing the faulty content fixed the problem but in order to do so on an enterprise laptop means you need the BitLocker Recovery key and possibly the local administrator password if you have to use Safe Mode. If you can get into WinRE Recovery Environment then you can get to Troubleshooting → Advanced Options → Command Prompt which runs as the SYSTEM god-mode user account. That meant manual intervention by IT staff who had to either walk each user through the fix remotely on the phone. Or hands-on and possibly using a bootable flash drive. In our case, half our PC’s were installing Windows Updates when CrowdStrike crashed Windows half-way through. Most were recovered but we had around a half-dozen laptops get bricked to the point of needing to be swapped to get the user back online and working. Servers were easier to fix due to no BitLocker. VM’s had to have their virtual disk disconnected and mounted on a different VM then you deleted the offending CrowdStrike Falcon C-00000291*.sys and dismount and remount to the original VM. For hard servers we could access via BMC (lightsout / ipmi / etc) then we could get it to boot into WinRE and we deleted the file and rebooted. We were able to automate both server fixes and all the servers were working early Friday morning. But yeah… We conscripted any IT staff with a pulse that could follow directions and help someone else do the same over the phone. We didn’t even log tickets for the end user fixes. We just shared a spreadsheet and went as fast as possible. Had our offshore IT workers covering after hours. Prioritized critical staff and fixed them first.

Having spent late Thursday last week through Monday fixing tens of thousands of servers and laptops; I can say that we are not pleased with CrowdStrike. Despite the mess, CrowdStrike is normally really good software and it does a far better job than any other EDR - security tool. I would expect a license discount of significant proportions for the next year or two. We were fortunate we were able to get out of this mess with only 2 business days of downtime and the weekend. We brought up customer facing solutions first then the VDI (virtual servers / workstations). Only then did we tackle the end user workstations, mostly laptops. I was certainly glad to have a Mac for work as my own PC started experiencing the BSOD. It enabled me to fix it myself as I have access to the BitLocker Recovery Keys and Microsoft LAPS to look up the randomized local administrator password.

The EU mandated Microsoft not block access to the kernel because it means Microsoft has access to private API’s and the low level kernel which gives them an unfair advantage competing with 3rd party developers. However, Apple did revoke kernel access while adding alternative API’s and system extensions. Microsoft needs to do the same thing. They also need to explain that fairness shouldn’t apply to kernel ring level 0 and that only Microsoft should have kernel level access the same way Apple only has kernel level access. There’s a big difference between providing developers full unrestricted kernel access vs granting them only the access they need and to sandbox the 3rd party code from being capable of causing a kernel panic.
stottm

29 July 2024

The July Microsoft Windows Update may result in an unexpected BitLocker Recovery Key prompt and this has nothing to do with CrowdStrike causing Windows kernel panics (BSOD).

Not all but some people had CrowdStrike fixed only to reboot and get another BitLocker Recovery Key prompt. We rotate the recovery keys as they are single-use. Once you use a recovery key it boots up and an MBAM client phones home and then rotates the key to a new one and escrows it back into Azure AD / MBAM.

I saw this more than a few times after fixing CrowdStrike, but thankfully not everyone. Some PC’s were half-way through the Win Updates when Falcon started the BSOD boot loop. Once we got them into Safe Mode, the updates will uninstall, reboot and uninstall some more. This might mean entering the BitLocker Recovery Key 2-3 times. Once you delete the C-00000291*.sys Falcon content update and reboot it would be fine. But the next time the user reboots, those Win Updates are re-installed and it may still come back with a BitLocker Recovery Key prompt. Calling the Help Desk and putting in the newest recovery code will fix as it won’t do it again.

Microsoft isn’t free and clear: 1) they should block 3rd party developer access to the operating system kernel. 2) They should provide a sandboxed safe API for security related apps to utilize. Unsafe actions such as writing to a protected memory space should not be allowed. There should be enough warnings about the revocation of kernel access and hopefully by Win 12 it will be all worked out.

Ever since Apple revoked access to kernel extensions, I’ve not witnessed a single kernel panic in macOS across many thousands of Macs.
stottm

29 July 2024

Only CrowdStrike customers were impacted. That includes many of the top corporations in the world. The Falcon enterprise license is $184.99 per computer and another tier that is negotiable depending on how many computers you have.

CrowdStrike is a security research think-tank and they consult with entities who have been compromised by malware and hackers. They investigated major breaches like Columbia Sony Pictures. If a new novel attack is discovered, they can create a rapid response channel update and push it out to all their customers in a very short period of time. Thus providing proactive protection against an in-the-wild threat that is actively being exploited.

The 291 channel update was intended to defend against hackers abusing a developer interface known as Named Pipes. Somehow that configuration was malformed and it wasn’t caught by their normal testing methodologies. They are adding several checks to their pipeline before it is published to Falcon sensor clients. They are also going to update the client to do it’s own sanity checks when reading a bad channel update. The hex zeros written to the file appear to be due to Windows crashing and the file being reset to all zeros. However, there was originally different content for the first crash. Subsequent crashes came about because Falcon read that faulty file with zeros that based on the amount of data in the file, Falcon jumped to a portion of memory it wasn’t supposed to and attempted to alter what was in that memory location. Doing so, caused the operating system kernel to panic triggering the BSOD at every boot. Windows then reboots after writing the BSOD details to the logs and it may keep some DMP files for debugging. It immediately crashes on boot every time the Falcon sensor loads that bad content update.
Jolin

29 July 2024

David C.:

The laws in the EU need to change.

[…]

Microsoft allegedly developed something similar for Windows but European lawmakers decided that they couldn’t use it because it was going to be restricted to only approved security companies, which would be illegal anti-competitive behavior.

Microsoft has blamed the EU but it’s not clear this is the case. Doesn’t seem hard to believe that Microsoft is blaming ‘regulation’ when it’s actually Microsoft’s fault that Windows has a poor security architecture.

x.com

Ian Brown ð®ð¨ ð¦£ ð¦ (@1Br0wn) on X

@1Br0wn

ANOTHER opinion piece [https://t.co/ndJwPOWTFv] repeating Microsoft’s claim the EU is responsible for the #CrowdStrike debacle. You can read the interoperability undertaking Microsoft made in 2009 yourself… no, it does NOT require kernel access for Windows competitors
Nello Lucchesi

30 July 2024

silbey:

About which he later notes: “To be clear, I was trolling last night…Yahoo News is quoting me as a source. This is getting out of control.”

More confirmation that the rumor about Southwest still using Windows 3.1 is a hoax:
pxlnv.com

Southwest Airlines Did Not Dodge the CrowdStrike-Caused Outage Thanks to...

Thom Holwerda: A story that’s been persistently making the rounds since the CrowdStrike event is that while several airline companies were affected in one way or another, Southwest Airlines escaped the mayhem because they were still using Windows...

UPDATE July 31, 2024 12:14 PM

Additional sources confirm that stories about Southwest using Windows 3.1 are misinformation.

An OS News article links to the tweet by Artem Russakovskii that was the hoax’s origin. The article also links to an The Dallas Morning News article, titled “What’s the problem with Southwest Airlines scheduling system?”, that claims that Southwest used obsolete versions of Windows.

https://www.osnews.com/story/140301/no-southwest-airlines-is-not-still-using-windows-3-1/

ABC reported that neither Southwest nor Alaska use CrowdStrike

ABC News

Most airlines except one are recovering from the CrowdStrike tech outage. The...

Delta Air Lines is struggling for a fourth straight day to recover from the tech outage, even as other airlines are returning to nearly normal levels of service
David C.

30 July 2024

Nello Lucchesi:

More confirmation that the rumor about Southwest still using Windows 3.1 is a hoax:

Joking aside, if you ever glance behind the counter at all kinds of businesses (including airlines and auto repair), you’ll find that quite a lot are still running an IBM 3270 terminal emulator that is (presumably) connecting to an app running on a mainframe somewhere.

For apps like this, there is no need whatsoever for a modern computer. An 8088 PC running MS-DOS can run a 3270 terminal emulator just as well as a brand new PC or Mac running Windows or macOS.

Which brings to mind an interesting thought. For all those businesses that are still using 3270 emulation, why are they bothering to run full-featured operating systems like Windows (or macOS)? Why not instead get the simplest hardware that your IT department can maintain (which can be the cheapest PC made or a Raspberry Pi or other similarly small systems) running an operating system stripped-down to the minimum necessary features for running the 3270 emulator.

Linux can be stripped-down that far, to the point where you can boot it from a small read-only microSD card, which would be very secure, simply because there isn’t very much software to attack and a reboot would reset everything.
silbey

31 July 2024

David C.:

Why not instead get the simplest hardware that your IT department can maintain (which can be the cheapest PC made or a Raspberry Pi or other similarly small systems) running an operating stripped-down to the minimum necessary features for running the 3270 emulator.

It’s a great question – my suspicion is that a lot of businesses don’t want to go through the annoyance of trying to run something even vaguely custom. They’d much rather buy something “off the shelf” (so to speak) with lots of promises from salespeople.

The allied question is why upgrade when you have a solution that works? We hear stories about organizations still running off floppy disks etc, but if it works why take the risk and go through the effort of upgrading it? If it becomes unrepairable or the supplies aren’t available, then sure. But otherwise? Especially with something that’s not connected to the Internet and thus without serious security risks…

But really, I’m just posting to get to the point where I can add one of my favorite news stories of the last two decades – the business in Texas still doing (as of 2010!) their payroll on an IBM 402, a machine released in 1948!

kottke.org

Don’t mess with Texas’s old computers

As recently as last year, a liquid filtration company in Texas was still using a computer built in 1948 to run all of its accounti
Adam Engst

31 July 2024

David C.:

Why not instead get the simplest hardware that your IT department can maintain

I can only imagine what it’s like to be a young person getting a job in an IT department for a business that still relies on a mainframe app. “Kid, pull up your VT100 and let me show you some COBOL.”
Michael Schmitt

31 July 2024

A fun trick to play on the newbies is to switch their TN3270 client into APL mode when they’re away from their desk.

What Should Apple Users Take Away from the CrowdStrike Debacle?

Subscribe today so you don’t miss any TidBITS articles!

Comments About What Should Apple Users Take Away from the CrowdStrike Debacle?

Notable Replies

Join the discussion in the TidBITS Discourse forum

Participants

Share

Subscribe today so you don’t miss any TidBITS articles!

Comments About What Should Apple Users Take Away from the CrowdStrike Debacle?

Notable Replies

Join the discussion in the TidBITS Discourse forum

Participants