No Fun At Work
At 5:34 yesterday we had a momentary power failure that brought down a large bunch of machines at work. I got called by a guy who was there to come in and try and see what I could do to get things up and running again.
Most of the time the process is smooth. Not so this time. Our main database didn’t come back up. It rebooted, but got stuck in the middle. I rebooted it again, and this time it kernel paniced. I put in a boot CD and attempted to rebuild the raid arrays, but no dice.
After some investigation it seems as though 2 of the 3 RAID drives went bad. The machine was built to withstand one drive going bad, but not 2. I actually had a fourth hard drive in the machine to bring it up to a 4 drive RAID system, which means it could have withstood 2 failures, but I never got around to implementing it. I had completely forgotten about it. Don’t get me started on why.
This morning I rebuilt the machine and reinstalled the databases. I do nightly backups, so I lost a days worth of data. Not the greatest, but livable. By 11am I had both database programs reinstalled and the data functioning again.
This computer is also the source code repository machine that I use to contain my programs. I had to reinstall that as well, and do some fiddling with backups to get it back functioning again. But, it wasn’t a big deal.
However, one other small piece of software that ran on this machine I didn’t have any backups of that I could find. That was teh suck, most definitely. Of the four test cells that needed to be running, two of them relied heavily on this small piece of software and I didn’t have a copy. So I spent all afternoon rewriting it. And I think I’m done, save for seeing how well my rewrite survives the abuse that it’s going to get.
The bigger dilemma is that I couldn’t just drop this new software into place. The issue is that this software is the server and there are a lot of clients throughout the facility. Most clients can run fine without the need for the server - the ones that can’t I know about and can take care of. However, each of the clients have a piece of code in them that if they see the server disconnect, they shut themselves off. This is a piece of safety code that, while inelegant, saves from blowing things up (literally!) in the face of a computing disaster.
However, I couldn’t just drop my new server into place because if I needed to restart it for any reason (say I found a bug in my code) it would shut all of the clients down. So I had to put the server in a new place, and now go to each individual client and tell it where to find the new server. That process will take a few days.
At least I got my “facility downtime” that I’ve been asking for for so long now. Too bad it had to be under these circumstances, though.
December 29th, 2005 at 9:52 am
Are you all IT now or do you do any engineering tasks?