So as part of my job I’ve had to start taking the firehose crash course in Fiber Channel (or Fibre Channel, as some still put it) storage technology. It’s interesting, still somewhat relegated to the higher end of storage devices, but with Apple’s recent (okay, 3 year old) introduction of XSan and such to its systems for the video editing world, a lot of it has become somewhat more reasonable and non-Fortune-50-level. However, as I’ve found, there are very few people who really know their stuff when it comes to complicated fiber channel setups; at some point I’ll try to relay what I’ve learned. But that’s not for this post.
In this post, I would like to relay to you just how I discovered how robust NetApp storage really is. NetApp has been around for at least fifteen years (probably earlier); I’ve been interacting with their equipment in various capacities for almost ten. They’re basically dedicated file serving appliances; they’re loosely based around some customized Intel X86 architecture, with a very nice filesystem and operating system that tries as hard as it can to protect the data you store on it. Oh, and their systems are quite fast at serving NFS and CIFS traffic, too — some wag at work decided it might be a good idea to store old data on Buffalo Terastations, which, while fine for home users, really kind of pale in a multi-user situation. Copying data off of these things occurs at a maximum of about 7Mbytes/sec, whereas I’ve gotten upwards of 250Mbytes/sec going to a NetApp box, and it wasn’t even one of their higher end models.
NetApp isn’t the fastest network storage out there (that award would probably have to go to BlueArc), but they’re solid, reliable, and recover gracefully from pretty much any failure you can throw at it. NetApp is the company that first introduced me to the concept of “we know what’s wrong with your system before you do”. If a drive is failing (it doesn’t even have to have completely failed, just be throwing enough weird juju that the OS loses confidence in it), the OS will snap up a spare disk, start copying data from the failing drive to the spare drive, and then fail the bad drive and move the spare drive into the raid, all without you really noticing unless you’re paying attention to the log messages. This is the company for which the way you notice that a drive failed overnight is that a replacement drive is waiting for you as you arrive at the office the next morning.
(A note to the brave and/or foolhardy; don’t try this at home or at work, it’s certainly not supported by NetApp, and I got lucky. It may void your warranty, though I trust it won’t cause your NetApp to burst into flames.)
So, I was working with just such a system, adding some extra shelves to it as we’re trying to move towards 500G and 750G drive shelves, away from the older 275G and 320G shelves to increase density of this particular system. It uses fiber channel to connect from the head to the drives, which means that although the proper way to make changes to the system is to take it down, reconnect things as needed, then bring it back up, you can make certain changes (like adding disk shelves) on the fly.
I connected up the shelves together, wired them up to the file server, set the shelf IDs, and powered them on. The system recognized the new drives, and started doing its work to add them into the system as spare drives.
Except for one piece of stupidity on my part. I’d forgotten that shelf IDs are not like SCSI IDs; even though the little setting supports it, you can’t have a shelf ID of zero, and that the shelf IDs start at 1. Which meant that shelf “zero” wasn’t being recognized by the system.
So I figured, what the hell, they’re spare disks, let’s just power the shelves back off and reset the IDs. I’d noticed that the system was upgrading the firmware on one of the shelves, so I waited until this was done and it didn’t appear to be doing anything else to that chain of disks, powered them off, reset the IDs, and powered them back on.
The system did scream bloody murder (beeped, sent pages to the admins, and opened a trouble case with NetApp support), but after about five minutes of twitching, the system figured out that the 56 disks that had just disappeared had indeed reappeared, and that after a few bouts of convulsing and finally calming down, everything was really all right.
What impressed me most about this was the reasonably graceful manner in which the NetApp figured out what was happening and recovered from what for all intents and purposes was a catastrophic disk failure. I suppose the two things that made this less severe than it could have been were 1) the disks were all spares, not data disks, and 2) the disks were on their own fiber channel interfaces that didn’t have other disks on them as well. But my recent experiences with Macintosh computers on fiber channel (even if a disk’s not mounted by a system, if the Mac can see it over the fiber channel network and the disk goes away, the Mac will probably lock up at some point in the future if you don’t reboot it first) had made me wonder what would happen when I tried this. I would at least have expected the system to not re-recognize the disks the second time I powered them up.
I have to say that for as pricey as these things are (usually in the five to low six figures; though they’re not nearly as bad as some higher-end storage), they’re worth the amount you pay for them in purchase and support. The systems don’t go down, the support organization behind them is stellar, and they just plain work. The only thing I don’t like about them is their new logo (the monolithic ‘n’; it looks like a piece of a henge. You know, like stonehenge, woodhenge, and strawhenge). Beyond that, I’m quite happy with that equipment.