Elements or Lower

Fri, 22 Oct 2004

Server Woe

I don’t want to get excessively melodramatic about this, but yesterday really, badly, sucked.

About three days ago, my server suffered a kernel panic and crashed. Once phone call to the managed hosting company, Rackspace and it was promptly brought back up. At the time, it seemed like an isolated incident, and there wasn’t anything to suggest that there would be further problems down the road. We kept an eye on it anyway.

Then, the weird errors started. Bits of software that had worked unmolested for years began sporadically to refuse to compile. And another crash.

Then another in the early hours of yesterday morning. And another. By this point, the MySQL database logging traffic to woking.gov.uk refused to auto_increment. The search engine on the site stopped working because it couldn’t cache the results any more. I couldn’t download a local copy of a gzipped mysqldump on the database because the download would crash my FTP client. And still more crashes.

Rackspace had already replaced the memory and the processor on the machine just in case. But it was becoming pretty clear that the filesystem, or the hard disk, or both, was breaking apart before our eyes.

As the day progressed, this became more and more horrible to watch. Trying to retrive files from the server became like fishing bits of imprudently-dunked biscuit from a cup of tea. We had to migrate to a new server, and sharpish.

Now, the server has a certain amount of custom configuration on it, and generally speaking, setting it all up in the first place was something of a headache. There’s also about half of CPAN on there. And I was growing worried that we’d never get the MySQL data out of there intact.

The server does have a tape backup, but it’s a fairly rudimentary setup whereby the drive is backed up nightly to the same tape. No rotation. So, if the filesystem was screwed, the backup might have been as well.

As it turned out, a member of the Rackspace support team in San Antonio named Rich is an absolute miracle worker. The drive was ripped out of the old server, and added to a mount point on the new. Rich then set about recreating everything for me straight from the old drive. Somehow, by midnight, he’d got it working. I’d gone from despair to jubilation in about three hours and a bunch of phone calls to the States. There are times when the relentless optimism of our American cousins is just what one needs.

Alan in the UK also deserves his dues for sticking with the support ticket and being generally very helpful. But Rich is my new hero.

I’m aware, of course, that I’m in danger of making this a giant advert for Rackspace. But I don’t care, really. I’ve had frankly appalling service from other hosting providers before, and consequently I’m now firmly of the opinion that you gets what you pays for. If singleminded attention to your case in an emergency floats your boat, I know who to recommend.

Of course, there are lessons from this. We’re investigating investment in a resilient, multi-server, load-balanced affair for Woking now; and there are better backup solutions than tape that really weren’t around when I started all this. But for now, I’m just counting my blessings that we’re still here at all.