At some point on Tuesday, 07-Oct-08, our main Web server lost its mind. The scripts that drive our site went out of control, and the server became somewhat unresponsive. We had been Slashdotted as a result of Rich Mogull's excellent "Peering Inside a Mobile Phone Network" article, but the load simply wasn't that high, and the server has handled previous Slashdottings before with aplomb.
We were slightly out of date with the most recent updates and thought perhaps updating from Mac OS X Server 10.5.4 to 10.5.5 might solve the problem. Instead, it rendered our Intel-based Xserve inoperable. The updates failed with an odd message about the root - or / directory - being unwritable, and forced us to restart. Unfortunately, the installer had left the machine in an even more dysfunctional state, and the machine didn't come back up. Our co-location host, digital.forest, rebooted it by hand, and it would boot only so far as the Apple logo before restarting automatically. The same problem happened when trying to boot into single-user mode. Not good. The techs at digital.forest then booted from a DVD and ran Disk Utility, which apparently found and fixed some errors, but nothing that solved the boot problem.
No worries, we thought. Adam had set up two separate backup systems: Time Machine to a second drive in the Xserve, and CrashPlan (see "CrashPlan: Backups Revisited," 2007-02-26) to a separate server in Adam's home.
Starting the Recovery Process -- You know that saying about confidence going before a fall? Well, after booting from the Leopard Server DVD, digital.forest did what seemed like a sensible thing and restored our most recent Time Machine backup, which dated from 11:50 AM of that day. The machine came back up, and we were able to get back in via Apple Remote Desktop. It took a little fussing to get the Web server turned on again, and then, worryingly, our site wasn't showing any articles later than 30-Sep-08. Indeed, when we looked at the databases, they were from that date, despite Time Machine's claim that the most recent backup was from that day.
We were starting to get nervous, but we had another ace in the hole. Adam restored from his local CrashPlan backup, uploaded the files we needed, and I took at look at what he uploaded for me. Again, no love, but for a different reason. CrashPlan had more recent copies of many necessary files, but also had tons of restore failure messages - noted in the file names - scattered throughout the retrieved directories. It was a mess, but after some panicked scanning of the CrashPlan backups, I realized that the main database files we needed were present and intact. I replaced the Time Machine versions of the database with the later CrashPlan versions, checked their consistency with MySQL's command-line tools, and brought the system back up.
And with a great sigh of relief, we both started to breathe again, because the CrashPlan backups were from the morning of that day, so we'd lost almost nothing in our main TidBITS article database. We weren't able to recover later versions of some other databases, but those were far less important and losing a week's work of stats and changes wasn't a problem.
With a bit more elbow grease and effort, all made more difficult by me being on vacation on Mt. Desert Island in Maine (but with a high-speed DSL line in our lovely rental house), Adam and I managed to get back to status quo ante, having to repost only a few stories manually.
Debriefing and Navel Gazing -- Despite having a previously reliable system, a reliable release of Leopard Server (10.5.4), and multiple backup systems, all supported by a capable support staff at a co-location facility - we were dead in the water for hours. That's a bad thing, and generally depressing.
I think something went wrong starting on 30-Sep-08, and whatever that was, the problem grew gradually worse until the day of the failure. This putative problem affected the system and the capability of both backup packages to create archives and to notify us that the backups weren't being properly made.
Adam doesn't buy my theory about there being some sort of creeping corruption starting on 30-Sep-08, and he points out that when he browsed backward in Time Machine, looking at the directory that holds the main databases, Time Machine had only four widely separated backups: 30-Sep-08, 09-Sep-08, 08-Sep-08, and 27-Jul-08. Some other files (in ~/Library/Preferences) that change every day were backed up regularly. The implication is that Time Machine backed up these critical files 4 times out of about 70 days. Assuming hourly backups, there were about 840 chances to back them up. They may not have changed every hour, but even if they changed only once per day, that's still abysmally poor performance on Time Machine's part. Most disappointing is that it in no way informed us of its criminally negligent behavior.
If you're using Time Machine as your primary backup method, especially on a Leopard Server machine, stop reading this article right now, find a frequently modified important file in the Finder, go into Time Machine, and flip back through the changes, making sure that you have regular backups and not the sporadic nonsense that we had.
Although it ultimately saved the day, CrashPlan also let us down in a big way. It had been backing up files from our server successfully for several months with no problems. However, as soon as I tried to restore files, it threw thousands of integrity check errors, so many that it was impossible to sort through them in the logs. Again, if there are problems, alert on backup so the user knows about them before trying to restore.
We have plans for adding more redundancy, and we'll carry out those tasks in the near future. Cloning the main drive to its second internal hard disk as well as backing it up in two separate fashions might have allowed an immediate resumption of service. (That's how our other server is set up, and that approach saved our bacon on one occasion.)
Some folks would recommend striped and mirrored multiple disk arrays (RAID) where we would have increased speed, less chance of failure, and greater restore capabilities (but that's in addition to backups, not a replacement for backups). In the past, a RAID hasn't been necessary because TidBITS feeds out mostly static information. As we move to a more dynamic site design that involves significant communication with and among our readers, that has to change, too.
This is the kind of thing that drives grown men and women to drink. In the middle of all this, while we waited for something to happen, I sat down with my wife, two children, and six in-laws for some delicious soup made by my wife. And a beer. And around 9 PM, Adam ultimately had dinner and very possibly something stronger to drink. Like the captain being the last one off a sinking ship, the publisher eats only after the server is back up and running.