ghetto raid scrubbing with linux

As a follow up to my adventures with Linux RAID scrubbing (or lack thereof), I decided to poke around a bit more this weekend after a filesystem started throwing some errors.

It appears that someone did fix at least part of the issue I ran into — a memcpy() was left out of the repair kernel code — but I’m not planning on installing that kernel for a while. Not without some serious testing, or perhaps after it’s applied in a RedHat/CentOS update kernel.

However, I did come up with something that may work as a very ghetto software RAID1 verification technique. (The following keywords should help someone google this post: linux software raid verify scrub oh shit.)

Here’s what you do. First, find the size of the mirror from /proc/mdstat:

md5 : active raid1 hdd1[1] hdb1[0]
      58613056 blocks [2/2] [UU]

Multiply the number of blocks by 1024:

[root@linux] # bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
58613056*1024
60019769344

Then, run cmp on the two devices that make up the mirror:

[root@linux] #cmp /dev/hdb1 /dev/hdd1
/dev/hdb1 /dev/hdd1 differ: byte 60019769345, line 246090365

If the byte at which the two devices differ is a higher number than the one you came up with using bc, it means both mirrors contain the same data. (From what I can tell, that’s the area where the raid metadata/superblock sits, at the end of the disk.)

If the differ byte number is smaller, you can probably do a more extended test with cmp -l to find out what data differs and whether there are one or more differences. Not sure how to repair at that point; if you feel lucky, you might be able to do some kind of block editing (and guess the value that block should be), but I’m not about to try that part.

Part of the point of scrubbing is to read every byte of data from every disk and make sure there aren’t any read errors; if there are, it should throw a kernel error which shows up in logs, or with IDE might allow the drive firmware to reallocate a block that has a soft error in it (which will show up in smartd’s output).

Note that this will only work with RAID1; RAID5 lays out data differently, in stripes of data and parity, so you’d have to do parity calculations as well as figure out where they are. It could probably done with some programming, but that’s left as an exercise for the reader }:>.

So yeah, it’s really ghetto, but it appears to work. And now I don’t feel like I’m flying 100% blind and not knowing whether my mirrors are really mirrors. If I feel industrious, I’ll probably put this into a shell script and start running it weekly or something.

4 thoughts on “ghetto raid scrubbing with linux”

  1. Hello.

    it’s my frist time here and really i like this page iam here becouse of some mistake in google but the time i do spend here is so long really good work

  2. sweet. nifty test. I just tried it out on a box. I got bit by linux raid a few weeks ago when /dev/sda failed and /dev/sdb did not have the correct boot block on it. Luckily it was an intermittent failing and i was able to get grub installed ok.

    there is a certain amount of chicken waving involved with pc hardware and linux raid. i miss sun and vxvm.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>