As a follow up to my adventures with Linux RAID scrubbing (or lack thereof), I decided to poke around a bit more this weekend after a filesystem started throwing some errors.
It appears that someone did fix at least part of the issue I ran into — a memcpy() was left out of the repair kernel code — but I’m not planning on installing that kernel for a while. Not without some serious testing, or perhaps after it’s applied in a RedHat/CentOS update kernel.
However, I did come up with something that may work as a very ghetto software RAID1 verification technique. (The following keywords should help someone google this post: linux software raid verify scrub oh shit.)
Here’s what you do. First, find the size of the mirror from /proc/mdstat:
md5 : active raid1 hdd1[1] hdb1[0]
58613056 blocks [2/2] [UU]
Multiply the number of blocks by 1024:
[root@linux] # bc bc 1.06 Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc. This is free software with ABSOLUTELY NO WARRANTY. For details type `warranty'. 58613056*1024 60019769344
Then, run cmp on the two devices that make up the mirror:
[root@linux] #cmp /dev/hdb1 /dev/hdd1 /dev/hdb1 /dev/hdd1 differ: byte 60019769345, line 246090365
If the byte at which the two devices differ is a higher number than the one you came up with using bc, it means both mirrors contain the same data. (From what I can tell, that’s the area where the raid metadata/superblock sits, at the end of the disk.)
If the differ byte number is smaller, you can probably do a more extended test with cmp -l to find out what data differs and whether there are one or more differences. Not sure how to repair at that point; if you feel lucky, you might be able to do some kind of block editing (and guess the value that block should be), but I’m not about to try that part.
Part of the point of scrubbing is to read every byte of data from every disk and make sure there aren’t any read errors; if there are, it should throw a kernel error which shows up in logs, or with IDE might allow the drive firmware to reallocate a block that has a soft error in it (which will show up in smartd’s output).
Note that this will only work with RAID1; RAID5 lays out data differently, in stripes of data and parity, so you’d have to do parity calculations as well as figure out where they are. It could probably done with some programming, but that’s left as an exercise for the reader }:>.
So yeah, it’s really ghetto, but it appears to work. And now I don’t feel like I’m flying 100% blind and not knowing whether my mirrors are really mirrors. If I feel industrious, I’ll probably put this into a shell script and start running it weekly or something.
Hello.
it’s my frist time here and really i like this page iam here becouse of some mistake in google but the time i do spend here is so long really good work
Comment by Saif — March 19, 2007 @ 3:44 am
Linux: because rebooting is for adding new hardware.
Comment by Saif — March 19, 2007 @ 3:45 am
sweet. nifty test. I just tried it out on a box. I got bit by linux raid a few weeks ago when /dev/sda failed and /dev/sdb did not have the correct boot block on it. Luckily it was an intermittent failing and i was able to get grub installed ok.
there is a certain amount of chicken waving involved with pc hardware and linux raid. i miss sun and vxvm.
Comment by john — March 19, 2007 @ 11:38 am
[...] have been to reboot if that didn’t work, but I didn’t want to. (As Saif said in the previous post, rebooting is for adding new [...]
Pingback by Rant Things » mysterious error messages, part 1 — March 21, 2007 @ 10:50 pm