Apple Network Failure Destroys an Afternoon of Worldwide Mac Productivity
Somewhere around 3:25 PM Eastern Standard Time on 12 November 2020, my 27-inch iMac running macOS 10.15.7 Catalina started to behave oddly, displaying the dreaded “spinning pizza of death” wait cursor when trying to perform operations that are typically lightning fast. I decided to reboot, as one does. Interestingly, Josh Centers had just told me that he was rebooting his iMac as well because, as he said in the TidBITS Slack, “Mojave has gotten a little wonky.”
Rebooting didn’t fix anything and, in fact, made things worse because then we couldn’t launch any non-Apple apps on our Macs. Mail and Safari launched fine, but other apps did not. Clicking an app icon in the Dock did nothing other than cause our Macs to make an unfriendly “ding” noise.
We hadn’t yet discussed our mutual Mac headaches, and Josh had become convinced that something had gone wrong with his iMac’s SSD, so he booted into Recovery Mode and began running diagnostics. Then he switched to his MacBook Pro and found that it wouldn’t launch apps either.
Josh’s next message in Slack (from his iPhone) was:
This is extremely weird. I can’t launch Slack or Firefox on either of my Macs. Is anyone else seeing something like this?
To which I replied (from my iPhone):
I just rebooted due to my iMac being a little weird, and none of my login items launched. I was able to launch the App Store app, and some updates are downloading. Preview launches, but neither Firefox nor Slack do.
Josh checked Twitter and found a post from developer Jeff Johnson, the guy behind StopTheMadness (which improves the Web browser experience), Link Unshortener (which reveals the destination of shortened links), and Underpass (for peer-to-peer file transfer and chat with end-to-end encryption). Johnson’s tweet, which went viral, explained what was happening: the macOS trustd process was trying and failing to connect to a server called ocsp.apple.com.
Non-Apple apps actually were launching, but only after their attempts to connect to ocsp.apple.com timed out. A successful connection to ocsp.apple.com is not required for apps to launch, which is why you can launch apps while entirely offline. That’s why Johnson suggested blocking ocsp.apple.com using Little Snitch or another firewall, or just disconnecting from the Internet whenever you wanted to launch an app.
Shortly after that, others offered the more straightforward solution of adding a line to the
/etc/hosts file that maps hostnames to IP addresses in a way that overrides DNS. If you pointed ocsp.apple.com to 127.0.0.1 or 0.0.0.0 in
/etc/hosts, connections to ocsp.apple.com failed instantly, returning the Mac to normal operating status. I’m not providing those instructions here because they’re no longer necessary and in general, messing with
/etc/hosts isn’t something you should do unless you already understand how it works. If you did edit
/etc/hosts in this way, you should remove that line; Brian Matthews provided a command-line recipe for that in TidBITS Talk.
After an hour or so, Apple fixed the problem, and everything returned to normal.
How Did This Happen?
So what was going on? As I understand it, at app launch, Apple’s GateKeeper technology checks the certificates that Apple assigns to developers to sign their code. The name of the Apple server in question—ocsp.apple.com—points to Apple using OCSP (Online Certificate Status Protocol) to determine if an app’s certificate has been revoked. If that’s the case, macOS prevents the app from launching—it’s Apple’s way of ensuring that it can prevent an app discovered to be malicious from causing more damage. (You may remember that HP just suffered from self-inflicted problems after it unintentionally revoked a certificate—see “Code-Signing Snafu Breaks Many HP Printers,” 26 October 2020.)
What prevented ocsp.apple.com from responding? I doubt Apple will ever share details, and heads may already have rolled, but my understanding is that the massive load from releasing macOS 11 Big Sur resulted in the failure of a CDN—a content delivery network—that Apple uses to handle such situations (this particular one appears to be run by Akamai Technologies, which is not unusual). Since Big Sur weighs in at 12 GB, compared to 8 GB for Catalina, it’s not entirely surprising that the load would be much higher. Plus, of course, Apple has sold millions more Macs in the last year.
Support for this theory comes from the fact that other Apple services were down that day as well. Apple’s System Status page showed problems with Apple Card, Apple Pay, iMessage, macOS Software Update (those Big Sur downloads), and Maps.
Apple Caused a Massive Waste of Time
It’s hard to overstate the effect this problem had on the Mac world. Although Josh and I were able to get our iMacs working properly again reasonably quickly, the rest of our afternoon disappeared into trying to figure out what was happening. In the MacAdmins Slack, IT admins and consultants were doing the same, not just because of their personal Macs but also because they were being deluged with calls, email messages, and trouble tickets from their users and clients. Developers received bug reports demanding fixes, and the problem disrupted many online presentations, meetings, and conferences taking place during that time. A Hacker News thread about the problem garnered over 1150 comments, including some from Mac users who, like Josh, wasted significant time with troubleshooting, worried that their Macs had suffered a hardware failure.
Apple may not have actually taken every Mac in the world offline, but this network failure wasted several hours of time for what must have been millions of Mac users. (I suspect that people who weren’t attempting to launch apps during this time might not have noticed.) Nothing will give us that time back, but an acknowledgment and apology would be welcome.
This debacle also threw a spotlight on what seems like a weak point in macOS. It’s clear that Apple designed trustd to fail silently and gracefully when a Mac is offline, but why is there such a long timeout in the event of a network failure? Are there other components of macOS that make similar checks in everyday usage that could hurt the user experience in error conditions?
As always, the question of security comes up as well. We’ve just learned that ocsp.apple.com is a weak link in the normal functioning of macOS. It’s obviously not a single overworked server under someone’s desk—the entire point of using the Akamai CDN is to make it possible to handle massive amounts of traffic—but I assume that malicious actors are investigating how to launch a denial-of-service attack against ocsp.apple.com.
There may also be some privacy implications, since the checks to ocsp.apple.com whenever you launch a non-Apple app could reveal information about you to someone who could access your network. That seems a little overblown to me—someone who can access your network has a lot more than OCSP traffic to work with. It doesn’t appear that Apple’s OCSP traffic is using OCSP stapling, which addresses those privacy concerns.
Some people have suggested using something like a Pi-hole to block ocsp.apple.com entirely. (You could use Little Snitch for that in versions of macOS prior to Big Sur, but as security researcher Patrick Wardle pointed out, trustd is one of the Apple apps whose traffic Little Snitch can no longer block—see “Apple Hides Traffic of Some of Its Own Apps in Big Sur,” 22 October 2020.) Blocking ocsp.apple.com seems like a bad idea because you would be vulnerable to any malware that Apple discovered and addressed by revoking its developer certificate. Apple runs many hosts that modern Macs must be able to contact at particular times for certain operations.
In the end, it’s hard to avoid feeling a little less confident in the Mac. I honestly believe this was a rare error on the part of Apple’s network operations staff, such that we’re extremely unlikely to ever suffer from it again. I also anticipate that Apple will be taking steps within macOS to prevent similar situations from occurring in the future and to address the concerns that this situation raised.
In fact, since I initially published this article, Apple updated its “Safely open apps on your Mac” support page with this text:
macOS has been designed to keep users and their data safe while respecting their privacy.
Gatekeeper performs online checks to verify if an app contains known malware and whether the developer’s signing certificate is revoked. We have never combined data from these checks with information about Apple users or their devices. We do not use data from these checks to learn what individual users are launching or running on their devices.
Notarization checks if the app contains known malware using an encrypted connection that is resilient to server failures.
These security checks have never included the user’s Apple ID or the identity of their device. To further protect privacy, we have stopped logging IP addresses associated with Developer ID certificate checks, and we will ensure that any collected IP addresses are removed from logs.
In addition, over the next year we will introduce several changes to our security checks:
- A new encrypted protocol for Developer ID certificate revocation checks
- Strong protections against server failure
- A new preference for users to opt out of these security protections
Those changes are all positive, and while it’s too bad that Apple failed to institute them proactively before this situation, I think this is mostly an indication of how hard security is. There’s certainly no conspiracy on Apple’s part—the company is only hurt when its actions detract from its pro-privacy stance.
Regardless, the fact that an Apple mistake could render Macs in general nearly useless shows just how interwoven our modern lives are with corporations like Apple. Not that it’s going to happen, or that there’s any realistic alternative, but if Apple were to disappear, our devices almost certainly wouldn’t continue to operate at their full capability.
Well, I think in addition to the massive PITA that this caused when everything started spinning at 3:25 PM EST…my RAID actually sustained damage to one of its four drives.
I purchased a Thunderbay Mini RAID from OWC in April to handle my suddenly-substantial video storage needs. It is configured as RAID-5 so that the information on each drive is mirrored on the other 3, at the cost of 2 TB of space but with the benefit of hot-swappable drives.
The 4 2TB Toshiba drives, preconfigured and sold by OWC with the enclosure, each have about 5,100 hours of use on them. When my iMac started acting like molasses, I took almost the same actions that Adam and Josh did. I also took the RAID offline, along with another backup drive that is still running Time Machine.
When I brought it back up at 4:34 p.m. after everything started working again, the SoftRAID utility began flashing warnings that one of the drives failed a SMART test and is “20 to 60 times more likely to fail in the next 2 to 6 months”. According to the expanded info window, there are now “16 unreliable sectors” on the drive.
So this cost me time in the middle of getting ready for a 65-attendee Zoom meeting for which I was the technician. And it will cost somebody money (maybe me, maybe OWC, maybe Toshiba) to fix the drive. I have the logs that pinpoint normal operation right before the server outage, and the disk issues when it came back up.
Well, it affected Safari and Mail on my MBP16 running Catalina. They would eventually start but it took several minutes and I didn’t help by canceling and rebooting in an attempt to fix the problem. I kind of suspected it was Apple (I checked system status) but I didn’t see any problems with Safari or Mail. Everything was back in about 20 minutes.
Yes some official acknowledgment from Apple would be nice!
This was just totally unacceptable. What is Apple doing, I’ve seen so many blunders over these past 6 months. At first I thought it was a malware attack of some sort. No problems with Safari or Apple Mail but Firefox bouncing in the dock like a rubber ball. Numerous restarts. Two hours of no productivity, all because of “trustd” (OCSP) apparently. Our Macs have to ‘turn off’ (for lack of a better description) because we are that wired to Apple?
8am Friday 13 November 2020 in Sydney Australia. a.k.a. International Verify Your Backups Day - TidBITS
Aha! So, in other words, Apple DID release Big Sur on Friday the 13th, at least in Australia.
And look what it got them.
I am so happy I had an early night! Live in Germany, so all this HooHa seems to have gone down while I slept; by the time I got up it was fixed & didn’t even know it happened until right now!
Unfortunately this isn’t the first time — it’s been happening sporadically for years. For example:
Hm. IPadOS too I think…same day, around 9:30pm my iPad spontaneously crashed. When it came back up, I couldn’t log into my WiFi Router, Twitter, Slack, Messages, or Facetime. I thought my iCloud account had been hacked or something. I found a recent note at Apple Support about “error activating Message or Facetime” which seemed close to the issue I was having. (No explanation for Twitter or Slack - I don’t use Apple iCloud logins for either of those…) It took about 2 hours of fiddling and toggling Messages & Facetime, and re-logging into Slack and Twitter, and eventually I got everything back to normal. I don’t believe in coincidences, and am pretty certain my issue is related to the Mac issues. I think Apple either had a serious failure as described in the article, or suffered some kind of attack.
Thanks so much for this story!
I wasted a ton of time on this and also suspected drive failures.
I got messages suggesting I wasn’t connected to the internet.
Software update wouldn’t work, suggesting I might be managed by an MDM, which made me think I was hacked.
Yes they absolutely should fail more promptly and gracefully. Poor exception handling in their software. Classic lazy developer code, never planning for the outage condition. I constantly remind my guys to code for this.
I don’t understand how an application failing to launch on your Mac could cause damage to your external raid drives.
I don’t love this source and the commentary feels like some “both sides-ism” covering up for not fully understanding the issues but it does a good job collecting a lot of links to primary and secondary sources covering this outage and the privacy angle of gatekeeper. I would add This one too, though.
I’ve updated the article to include Apple’s response:
Careful of what we ask for. Redirecting OSCP means that if someone does sign malicious code, it gets detected, and Apple does the right thing by revoking their certificate, you will not know about it and happily execute that signed code. OCSP is “slow” because it adds a network round-trip and is vulnerable to the OCSP server having a DOS attack. Note that from OCSP’s perspective, the known bug is an OCSP request will timeout and allow a revoked certificate to look good, not that good code will not run.
I would say it’s OK to do a temporary redirect for the OCSP server as outlined in the article as a temporary hack to route around a failed server. However, put it back the moment the server is back.
I’m sorry, I must be missing something. Did this problem only affect Apple developers, or did it affect regular old Mac users. AFAIK my home Mac (running 10.13.16) does not communicate with any Apple servers for anything unless I go to the website or launch the App Store. I don’t need to be connected for any cloud services and purposely stay signed out of anything I’m not using. I’m pretty sure I can run any local app I want regardless of my internet connection (and that’s how I like it).
Was this article about me, or only about certain kinds of users, or about cloud services?
So, Safari in my imac stoped working and is still not even launching after several days. I’ve tried unplugging the internet, rebooting my computer and have no idea how to proceed. I use my mac in my business daily and although I am using a different browser to get by, a lot of information is in Safari that makes my work easier and faster. I know that you all say that it’s working again, but not in my computer. I am “dead in the water”. Suggestions welcome.
It applied to all macOS Mojave and above users who needed to launch an application during the six hour period that servers were overloaded.
Then you are quite misinformed. macOS has needed to contact Apple servers for time of day as long as I can remember. Depending on whether you have disabled some settings, it will check for regular and background-critical software updates periodically during the day. The Gaming Center contacts Apple servers periodically, even if you never play any of it’s games. iTunes uses Apple servers for multiple purposes. I would have to guess your Mac contacts Apple dozens of times an hour or more for a variety of other reasons.
The article explains how it contacts an Apple server the first time you launch an application signed with an Apple DeveloperID which means all Apple apps, all App Store apps along with most other 3rd party apps these days and will repeat doing so ever 12 hours now (was five minutes).
If you are able to launch all other apps, then your issue with Safari is completely unrelated to this issue. Contacting AppleCare would be my recommendation.
Ahh, good point about NTP, I do have my computer set for the Apple time servers (though there are other options). But I really wonder about it contacting other Apple servers all the time. I’m a veteran MacOS user but dislike the rest of the Apple “universe,” so I don’t use any of the standard OS apps (except Preview and sometimes iTunes). No Safari, Mail, Calendar, Facetime, Siri, etc. Guess if I really want to know I can do some monitoring of my network activity. Not that it particularly matters. But I was surprised to see so much anxiety about that server outage!
Most of the contacts with Apple servers are from macOS processes, not their apps.
Yes, that has proven to be something of an over-reaction from many users and some misinformation from a self-described security expert, but also a wake-up call to Apple over privacy concerns they appear not to have considered. I personally went though the entire 6 hour period without noticing it, probably because all the apps I was using were already open before the problems started. I was more focused on problems trying to download the Big Sur installer that appears to have been the root cause of this other issue.
For the less tech savvy TidBITS members, could you take a few minutes to explain what you mean by “apple network” (as opposed to iCloud etc), and how central problems with Apple can affect individuals working with stand-alone macs ?
thank you !
This happened in the middle of our 4 day cycle of sending pages to our printer (we are a magazine publisher). A few days later I took advantage of what happened and suggested to our Editor that perhaps we shouldn’t keep allowing work to be done later a later. We are now always past deadline if you take our old deadlines that everything should be completed the day before it ships. I just said we now have so little slack that something like this could lead to us missing a printer’s deadline. Just making Lemonade from this lemon.
I also used as an example of our VPN failing a couple of days after deadline.
It will require a bit of hand waving, but in essence, Apple maintains an extremely complex and distributed network infrastructure (and iCloud is part and parcel of it). This infrastructure supports everything your Mac does that involves Apple, which might include checking for updates, checking the time, verifying certificate revocations, syncing data via iCloud, distributing two-factor authentication data, and much, much more. You can use a Mac offline for a while, but what you can do it with it will be somewhat limited, and after some period of time, it will probably want to connect again. That’s not really a major problem, though, since most of what most people want to do with their Macs involves online access.
We might think of this as “Apple’s network” or a particular service as running on an “Apple server,” but that’s far from the truth. Apple operates multiple data centers around the world and has systems that spread the load from all this traffic among those data centers and other content delivery networks like Akamai.
I hope that helps a little—there’s no way to be more specific without having inside knowledge of Apple’s network setup and operations, and the number of ways that a Mac might not function as expected if it were kept entirely offline for a long time is quite substantive.
VERY interesting. I am frankly astounded. Thanks very much @ace for your time and explanation.
Last question and I will leave you in peace. Is this a potential source of hacking ? Could someone basically “cripple” all Macs which seems incredible ?.
Of course anything is possible, but there are multiple layers of protection against such an outcome. Most all of these Apple networks are secured with three layers of certificates with the purpose of guaranteeing that you are connecting to the intended server operated by Apple or one of their trusted vendors. One or more of those certificates would be immediately revoked as soon as any issue involving them was discovered. And such servers located in physically secured facilities.
There was just an example of this system crippling Apple users with HP printers when HP mistakenly asked Apple to revoke certificates of all it’s print drivers. Although Apple was able to fix that issue rather quickly, some users are still trying to recover from it.
And, of course, this discussion points out another example of how things can go wrong, even when security has not been compromised. Not a complete crippling, but many complaints from users during the six hours or so that productivity was impactacted.
thank you for a very instructive answer ! greatly appreciated
@alvarnell’s answer is spot on, and my only addition is that Apple’s network infrastructure is among the most complex on the planet, and Apple has undoubtedly hired top network and security people to oversee it at all times. That’s not to say that mistakes like this one can’t happen—it’s difficult to predict failure cascades in situations of unprecedented load, or it could have been related to a hardware or network connectivity failure that also cascaded to other systems.
But you have to assume that Apple has one of the largest targets in the world painted on its back, so Apple systems are almost certainly under constant probing/attack from everyone from individual hackers to organized crime to government intelligence agencies (possibly even including ours). I don’t know anyone at Apple in that area, but I do know someone who used to work for Google in such a role, and my understanding is that it’s a non-stop battle to protect systems against such attacks.
So, as Al said, nothing’s impossible, but it’s not like Apple is a babe in the woods here.
You and me both.
I suspect the drive was in the process of writing when the failure began, and the drive in bay 3 was the unfortunate victim of a timing anomaly.
Who would know for sure? BUT: The SoftRAID logs bear out that it happened at the exact time that Apple’s service went down. Very thankful that it was just one drive.
OWC is replacing that drive. I shipped it off Wednesday, and am crossing my fingers that the three remaining drives in the RAID unit all do perform well over the next 36 hours while I edit this weekend’s program. RAID 5 allows one drive to be pulled, and the remaining three contain enough information to recreate what was on the missing drive. But it can’t do that if another drive fails in the meantime.
My redundancy move will be to purchase a couple of spares…better quality and larger capacity, to support an eventual upgrade to the unit. Adding a larger capacity drive will not actually expand the capacity of the array until all four drives are replaced, so I’ll do that over time.
I’m also looking at what category of files on the array could be considered archival, or reusable resources. Most of them are critical on the weekend they are produced, then their value is gone because in my work we don’t do re-runs.
Sounds like the “just-in-time” framework that spread from manufacturing to…well, everything. When applied to a parts inventory for, say, a vehicle assembly line, it was said to be very efficient. But as soon as something goes wrong (say, one part is delayed in delivery), the whole line is crippled.
I spent some time in newspaper production, in both editorial and graphics. Some deadlines could be bent, a bit, if it was important, but there was a big cost for that.
Good for you in seeing this unfortunate event as an opportunity.
So I can’t help wondering if Apple, trying to fix some underlying problem from November 12, has caused a new problem. Because apparently a lot of people who downloaded operating system installers — High Sierra and above – are having trouble getting App Store apps to launch. I just upgraded a client’s laptop from Yosemite to High Sierra and now every App Store app throws up a codesigning error. For an example, see Apps from Mac App Store crash when starte… - Apple Community
Code signing would seem be involved here, but I doubt it’s related to the OCSP issues discussed in this article. There’s a thread on it at:
Howard Oakley talks about the history of Apple’s code signing approach here:
Join the discussion in the TidBITS Discourse forum