linux raid is not my friend right now


For a while now, I’ve been using linux software RAID 1 (2 drives, mirrored, so if one fails, you still have your data). The one thing I started to notice, though, was that Linux had no method to do ‘scrubbing’, or verifying that the data on both drives really is the same. NetApps have it, 3ware cards have it, any real professional RAID setup has it. Except Linux.

So every now and then I google for ‘linux raid scrub’, and see if someone’s done it. Thursday, I discover that yes, someone has! They added a user-requested data verify to kernel 2.6.18; you do echo ‘check’ > /sys/block/mdX/md/sync_action, and it starts a scrub.

So I download the kernel, compile it (a daylong process), and Friday night I boot the new kernel on one of my systems and start the scrubbing. I’ve been wanting to do this for a while, since I suspected one of the drives was starting to fail.

The scrub takes a while; it’s a pair of 300GB drives. The next morning, I discover that one of the drives has been kicked out of the raid, so I replace it (had one on hand just in case) and add it to the mirror.

What’s supposed to happen next is that the kernel sees a new drive as member #2 of the raid set, and starts copying the data from the first drive to the second. What happened in this case was that the kernel added the drive and said everything was fine right away.

However, everything was NOT fine; when reading from a mirrored RAID set, reads can come from either drive, depending on which one has its heads closest to the data. One drive had the data. The other didn’t. Chaos ensued.

So after trying a few different things, I began to realize that RAID1 in this new kernel was broken. Horribly horribly broken. I tried to manually start the sync, left it for the 6 hours that it took, and found that it still didn’t fix the problem (apparently it hadn’t actually synced data; now that I think about it, that probably saved my butt, since it didn’t try to sync the bogus data on the new drive back to the old one).

So I remove the new drive from the raid set, boot back into the old kernel, and add the drive again. This time it does what it’s supposed to; now the drives are happy again.

I went back to to look at the release date for this supposed ‘stable’ kernel; turns out it was released Thursday. Not sure how that document made it out there that explained how to do a manual raid sync (that doesn’t work), but it did.

So now I’m back on an old kernel, with a newly rebuilt RAID set, and a few more grey hairs.

I dislike hardware RAID cards since I never know exactly what they’re doing with the drive; I like software RAID since in theory I can figure out what’s going on, look at the source, putz with it, break a mirror and mount the drive on its own, etc. (I got burned once by a cheap raid card that lost its config and hence the data on the drives.) And I can’t afford a NetApp, which seems to do RAID correctly.

I know the people working on Linux software RAID are doing the best they can with the time and resources available to them. But yeesh; this weekend was certainly Not Fun.