Recovering from a small-scale error.

Recovering from a small-scale error.
For extra value, see also http://www.ee.ryerson.ca/~elf/hack/recovery.html
(“Unix Recovery Legend” from 1986)

Originally shared by Craig Ulmer

Earlier this week, a message similar to the one below sent me into a panic. I’d just finished bringing up a new cluster and was working on some scripts to help automate the process of bringing new nodes online. I added a few tweaks to the script, closed the file, and without thinking, ran the script to make sure the edits worked. What I forgot was that I was on the cluster’s admin node instead of one of the backend nodes, and that the first thing the script did was use dd to zero out the first 1MB of /dev/sda. I saw the 1+0 lines and realized immediately how I’d just screwed myself over: the cluster was dead.

Erasing the first MB of a hard drive wipes the partition table and boot info (grub), and is sometimes necessary when you want to force a disk reformat on a stubborn node. While you don’t lose data, you do lose basic layout of the disk which makes it useless. Interestingly, you don’t lose the system immediately- the kernel still knows where everything is so it’s easy for this to happen at some point and for an admin not to realize it until he/she reboots the node some time later. At that point, bios says it doesn’t have a usable drive and there’s no info (afaik) that would help you figure out the partition layouts. ie, it’s toast.

Having done similar things like this before, I immediately started looking for info the kernel had in memory that I could use to repair things. I found the start/size numbers for each partition in /sys/block/sda, and then used sfdisk to write the offsets back to disk. The change seemed to work, as I could see the partitions again and even mount them. Like a fool, I decided to reboot the machine and see if it worked.

It didn’t of course, because the master boot record (MBR) still lacked boot code (eg, grub). If this was a desktop machine, I’d boot some repair iso and fix the problem. Unfortunately, this was a server sitting 1,000 miles away in a server room where the admins had already gone home. The one thing I had going for me was that we had both a working backend net and ipmi control over the node. That meant I could get into the admin’s bios and change it to pxe boot over the private net.

I setup some admin services on a neighboring login node and told the admin node to pxe boot a kernel with an initrd that went to a shell. From that shell I verified the boot code was missing, but I didn’t have any tools to do much about it. I cloned the MBR of the login node, pulled it over, and dropped it on the admin MBR. I still didn’t have the right partition table though and no sfdisk, so I had to boot the admin with a full linux image that would normally be used by the backend nodes. Once that booted, I updated the admin’s partition table and made some UUID repairs to things I messed up along the way. I rebooted the admin and voila, it booted fine.

It’s extremely satisfying to have worked through this kind of repair without ever having to go into the server room. At the same time, it’s a bit scary what one can do with access to the right service hooks. Maybe next time I’ll just try to avoid doing stupid stuff.
https://lh5.googleusercontent.com/-gFPy8taGWeg/UlmFWNNvfEI/AAAAAAAAHGM/6HTlotwih3Q/s0/floormat.png