Rebuilding Software RAID 1 refused to boot

October 30th, 2015 by

Dear LazyWeb,

Yesterday we did a routine disk replacement on a machine with software RAID. It has two mirrored disks, sda and sdb with a RAID 1 partition with software RAID, /dev/md1 mirrored across /dev/sda3 and /dev/sdb3. We took the machine offline and replaced /dev/sda. In netboot recovery mode we set up the partition table on /dev/sda, then set the array off rebuilding as normal:

mdadm --manage /dev/md1 --add /dev/sda3

This expects to take around three hours to complete, so we told the machine too boot up normally and rebuild in the background while being operational. This failed – during bootup in the initrd, the kernel (Debian 3.16) was bringing up the array with /dev/sda3, but not /dev/sdb3, claiming it didn’t have enough disks to start the array and refusing to boot.

Within the initrd if I did:

mdadm --assemble /dev/md1 /dev/sda3 /dev/sdb3

the array refused to start claiming that it didn’t have sufficient disks to bring itself online, but if I did:

mdadm --assemble /dev/md1 /dev/sdb3
mdadm --manage /dev/md1 --add /dev/sda3

within the initrd it would bring up the array and start it rebuilding.

Our netboot recovery environment (same kernel) meanwhile correctly identifies both disks, and leaves the array rebuilding happily.

To solve it we ended up leaving the machine to rebuild in the network recovery mode until the array was fully redundant at which point the machine booted without issue. This wasn’t an issue – it’s a member of a cluster so downtime wasn’t a problem – but in general it’s supposed to work better than that.

It’s the first time we’ve ever seen this happen and we’re short on suggestions as to why – we’ve done hundreds of software RAID1 disk swaps before and never seen this issue.

Answers or suggestions in an email or tweet.