This site is now 100% read-only, and retired.

XML logo

Methods for testing Linux Software RAID?
Posted by dkg on Mon 17 Jul 2006 at 21:31
Tags:
I have several machines with software RAID (both RAID1 and RAID5 configurations), with fairly modern kernels. I want to test the RAID before anything bad happens to the machines for real. What methods do you use to test software RAID on your servers? How do you verify that the kernel will detect device failures and deal with them properly? I present some ideas below, but want to hear what other people do for verification/peace of mind.

There are a few possibilities i can think of, but they can have inconsistent results, from the experimenting i've done:

  • [use mdadm soft fail the device] Tell mdadm that a device failed and watch how it copes with device removal, notification, and rebuild. For example:

    mdadm /dev/md0 --fail /dev/sdg
    This is a generally repeatable procedure, and behaves the same on most machines and raid setups that i've tried. However, it doesn't help me feel confident that my systems will properly detect actual hardware failures. If hardware trouble isn't detected promptly and reported back to the RAID layer as a failed component device, even the best post-detection behavior (which is what this step tests) is worthless.
  • [yanking disks] At the other end of the spectrum is physically yanking a disk out of its cage. Having a hotswap chassis makes this kind of test easier to do (though i'm not convinced it's entirely safe for the hardware to be subjected to this kind of abuse). However, i've seen inconsistent results for this, even with identical kernels.

    For example, an mdadm RAID5 setup with a 3ware SATA controller in JBOD mode (kernel module 3w_9xxx) running kernel 2.6.16 doesn't detect the device removal until something tries to access the md device itself. Then a report is generated, and the array starts to rebuild into the RAID5's hotspare.

    This is delayed detection isn't great, but it's a ton better than the behavior i see from another installation: an mdadm RAID1 setup with an Intel SATA controller (kernel module ahci) with the same kernel (2.6.16) reacts terribly to device removal. The whole system grinds nearly to a halt, many /dev/sdX error messages show up on the console, and the RAID subsystem doesn't even seem to notice for a long time (at least 5 minutes, i think, though i haven't done proper timing to verify this). Meanwhile, the machine becomes nearly unresponsive on the network, even for services which are not accessing the removed device. Interestingly, plugging the disk back while the system is still in this bad state lets it pick up where it left off, and doesn't even appear to be a RAID failure.

  • [using hdparm] A middle ground between these two testing tactics might be to use hdparm or some similar tool to disable the disk itself and see what happens to the kernel, the RAID subsystem, and the rest of the machine. I don't know hdparm well enough yet to know what parameters to use to take a device down like that, but as i figure it out, i'll post it to the comments here, or update this weblog entry.

Any advice or suggestions about things to consider or how other folks have approached this problem would be much appreciated, as would warnings about what not to do!

 

Comments on this Entry

Re: Methods for testing Linux Software RAID?
Posted by Steve (62.30.xx.xx) on Mon 17 Jul 2006 at 21:54
[ View Weblogs ]

At the other end of the spectrum is physically yanking a disk out of its cage. Having a hotswap chassis makes this kind of test easier to do (though i'm not convinced it's entirely safe for the hardware to be subjected to this kind of abuse). However, i've seen inconsistent results for this, even with identical kernels.

Remind me to write up sometime how we spent months dealing with intermittant faults with one server, running SCO, ultimately discovering that a test "yank" had been carried out when the machine was initially setup.

Course it wasn't hot-swappable ....

Multiple parts were replaced; motherboard, raid controller, drives, etc. Ugh. The machine was eventually written-off.

Steve

[ Parent ]

Re: Methods for testing Linux Software RAID?
Posted by simonw (84.45.xx.xx) on Tue 18 Jul 2006 at 18:38
[ View Weblogs ]
In the world of Enterprise hardware I was told by HP hardware engineers that yanking SCSI cables to test the systems high availability settings was "definitely outside the scope of the hardware warranty".

However there is no more realistic way to test catastrophic hardware failure. If the devices are hotswappable, and don't survive such experiments, they will almost certainly fail equally badly in use eventually-- given enough installations (of which yours is one).

Indeed the failure I was MOST expecting it to have to cope with was some prat in the computer room knocking one of the (supposed redundant) cables out by accident.

At least most modern hardware costs a lot less than the HP kit I was yanking cables from, so you won't end up paying back a mortgage sized repayment if the hardware dies and they hold you responsible.

Alternatively you can wait for your systems to fail, and see how good the vendors lawyers are.

For what it is worth that HP Enterprise kit did exactly what it said it would do in the manual under all cases tested.

[ Parent ]