Category Archives: Computer joy

Is this the year SSDs will get cheaper than disk?

I’ve been wanting to start tracking this for a while; we’ll see how well it works. Solid state disks have been showing up more and more and getting cheaper and cheaper; but they were still a good deal more expensive than traditional “spinning rust” hard disks. Then, the unfortunate Thai floods started in July (and are apparently some areas of Thailand are still flooded), affecting both Seagate and Western Digital hard drive factories. Since those two companies now encompass most of the world’s hard disk manufacturers — Western Digital is on track to finish acquiring Hitachi’s manufacturing, and Toshiba, Fujitsu, and Samsung are still there but represent a relatively small portion of the market — it had a huge impact on the price of spinning drives, and has already started affecting availability. In late October, drives started doubling in price, and people started making runs on stores; in late November, it was hard to find certain drives at NewEgg or Fry’s.

So here’s my attempt to start comparing drives, both raw street price and by cost/GB, to see when SSDs become a cost-effective option against spinning disks. We’ll see if it actually happens this year.

As of 1/15/2012, we’ve still got a ways to go:

Size Type Grade Manufacturer Model Cost Cents/GB
2TB Spinning SATA Consumer Seagate ST2000DL003 $145 7.25
2TB Spinning SATA Consumer WD WD20EARS $149 7.45
2TB Spinning SATA Consumer Hitachi 7K2000 2TB $160 8.00
2TB Spinning SATA Enterprise Hitachi A7K2000 2TB $450 22.5
2TB Spinning SATA Enterprise Seagate ST32000644NS $408 20.4
3TB Spinning SATA Consumer Seagate ST3000DM001 $232 7.73
3TB Spinning SATA Enterprise Seagate ST33000651NS $498 16.6
3TB Spinning SATA Enterprise Hitachi 7K3000 3TB $652 21.73
600GB Spinning SAS Enterprise Seagate ST3600057SS $492 82.00
600GB High-speed SATA Consumer WD WD6000HLHX $288 48.00
600GB Spinning SAS Enterprise Hitachi HUS156060VLS600 $457 76.16
200GB SLC SSD Enterprise Seagate ST200FX0002 $2,907 1453.50
200GB MLC SSD Enterprise Seagate ST200FM0012 $1,504 752.00
200GB MLC SSD Consumer OCZ OCZSSD2-2VTX200G $308 154.00

(Note, these aren’t price guarantees; they’re going to fluctuate constantly. The price links are to the searches where I found the prices listed. I’ll try to update once every couple of weeks to see where things go.)

Musings on version control and object stores

I’ve been looking at object storage systems (think Amazon S3, Swift, “cloud storage”) lately to see if they solve some storage problems, and I’m not particularly pleased with where any of them are right now. I realize it’s still early days for them, but all the ones I’ve looked at have some development time in store for them. Most of them eschew traditional storage systems, opting instead to use raw hard disks and replicating objects to protect against hardware issues, and keeping track of objects via an internal database or hashing mechanism.

The thing I can’t get past is that compared to traditional storage (RAIDed, high availability or clustered) and traditional databases (mysql, linter, filesystems like xfs, ext3), they all seem to want to reinvent the wheel. The object stores aren’t necessarily better for it; to get performance and protection you have to scale up the number of servers and hard drives to the point where it’s reasonable to ask whether or not it makes sense just to buy a traditional storage system. Object stores give you 33% storage efficiency; traditional RAID6 gives you 67%.

Separately, I was reading about the git source control software and got to thinking: it looks a heck of a lot like a simple object store, where the files are stored as objects, and ‘versions’ or ‘branches’ of a project are just different lists of objects.

If you take Swift as an example, and instead of deploying it on raw hard drives, use some kind of RAIDed and clustered storage (e.g. NetApp, BlueArc, EMC, maybe even Isilon or lustre) where the data protection (and recovery) mechanism is offloaded from Swift, you should be able to turn the number of object replicas down to one, instead of three. Storage protection solved by the backend system; and if a Swift node (either storage or proxy) dies, spin up a new server and point it at the storage system.

At that point, Swift is effectively storing files, calculating their checksums and returning a key to the client to use as an object key, keeping metadata (tags) if you like, and serving the files back on request. You can turn off the replication mechanism, since the storage is taking care of that for you. Keeping the scrubber running to make sure objects match their checksum probably isn’t a bad idea if you’re paranoid, but also shouldn’t be necessary (again, the storage takes care of that for you). But whether Swift is running on protected storage or raw hard drives, you can’t (easily) provide an index of the available files, or replicate them geographically (from, say, the New York server to the Los Angeles server).

Bring in git. git stores all files as objects, named by their SHA1 checksum, and keeps lists (“trees”) of these objects and of sub-trees to represent a directory structure. Update a file or few, and a new tree is created to represent the new structure and include the new objects.

Git apparently isn’t very good at large repositories of files because of how it stats every file in a tree during certain operations, while Swift uses a hashing mechanism to keep it quick. But if you serve up the git objects via a web server (perhaps with a module to strip out the git header and to look into git object packages) and primarily store or retrieve objects, you can use Basic or Digest authentication to protect the content (for GETs or PUTs), and (in theory) can use ‘git clone’ and ‘git pull’ to keep remote replicas up to date. If you had a mechanism for dealing with conflicts (or just never allowed file modification) you might be able to use this mechanism to allow multimaster replication between sites. And Git handles full-file deduplication by keeping only one copy of an object — if a second copy is added, the SHA1 checksum already exists, so no need to write a second file.

Richard Anderson at Stanford has already taken a look at this and listed out what he’s found when considering git as an object store, with some good detail on the strengths and weaknesses.

Git and Swift both have their strengths and weaknesses for specific applications; but maybe the Swift folks can take some direction from git, or maybe someone can address some of the issues with creating a git-based object store. Perforce is also good at holding very large source trees; they keep a database of the repository and use a proprietary client/protocol server, but maybe they’d be interested in seeing if a p4-based object store makes sense?

The REAL cage nut tool

Anyone who’s spent time installing servers or other electronics in standard 19″ (or 24″) racks has run into this problem. In these racks, the holes are square, and most equipment has brackets that require screws to attach. So you install a cage nut, which is a square-shaped nut surrounded by a razor-sharp piece of metal that’s designed to both hold the nut in the equipment rack and to cut the fingers of anyone who tries to insert the nut manually:

Cage nut in rack rail

So most rack and equipment manufacturers take a measured amount of pity on the poor installers, and included a bent piece of metal (seen in the illustration above) that they call a “cage nut insertion tool”, which is a) cheap and b) still a pain to use. Oh and did I mention it’s cheap? And they don’t work well at all for when you need to remove the nut later on.

A long time ago, I discovered a gentleman in Australia (whose web site I now can’t find) making these beautifully simple tools to insert and remove cage nuts. It’s a small machined piece of aluminum with a lever that’s shaped to carefully clamp and squeeze the cage nut, so that the razor sharp edges of the nut will pass through the square opening. Once the cage nut is seated, release the tool, and the razor sharp edges hold the nut in place.

Cage nut tool holding cage nut

Removal is equally as good. Instead of the traditional method — insert a screwdriver on one side of the nut so that the nut springs out and shoots across the room — simply apply the tool, squeeze, and remove.

It’s a simple, brilliant piece of technology, and anyone who’s got more than one equipment rack to deal with should have one in their tool chest. You’ll throw out every bent piece of metal that comes with cage nut kits as soon as you find them.

Available from Cables Plus USA or Rackmount Solutions. Disclaimer: I haven’t dealt with either of these businesses, so do your due diligence first. Disclaimer II: Not getting paid or anything, I just really like this tool.

5400RPM vs 7200RPM hard disks — should you care?

A recent twitter discussion via Doctor Karl led me to the question of whether it matters that while large capacity desktop drives spin anywhere between 7200RPM and 15,000RPM, most large capacity laptop drives spin at only 5400RPM.

For years, 5400RPM was the standard speed of a hard drive, back when 1GB and 2GB were huge. The speed demon server drives that came out shortly thereafter were 4GB and 9GB 7200RPM drives, and required active cooling so they didn’t melt from the heat. Everyone understood that 7200 RPM was faster, because it delivered your bits quicker.

Fast forward ten years, and you can get 1TB and 2TB hard drives spinning at 7200RPM, and smaller laptop drives at 500GB and 1TB spinning at 5400RPM. (You can also get 600GB drives spinning at 15,000 RPM, but that’s today’s bacon-cooker for servers.)

But is the 5400RPM laptop drive really slow?

I would argue it’s not. Areal Density increases as drive capacity goes up; and today’s 3.5 inch and 2.5 inch drives aren’t physically bigger than the same size drives of the past, meaning their areal density has gone way up (see some examples on Wikipedia’s Memory Storage Density page).

This means that when a 1TB disk makes one revolution (in 1/5400th of a second), it’s picking up much more data from the platter than, say, an 80GB disk spinning at 7200RPM is after a single revolution in 1/7200th of a second.

Granted, you’ll still get faster performance out of a 7200RPM drive than a same-sized 5400RPM drive, but it’s not the end of the world to pick the slower-rotating disk. It also won’t get as hot as the faster one.

Now keep in mind, too, that these speeds are talking about sequential or mostly-sequential data access, when you start at one point on the disk, and read continuously to a later point, like playing a record or a CD. Unless you’re playing audio or video, most data access isn’t like this. And that kind of performance — operations per second — hasn’t increased much over 20 years. But that’s another post.

fiber channel (fibre channel)

One of the things I want to start posting about is fiber channel (sometimes spelled fibre channel). It’s something I was first exposed to back in 1998, and have been dabbling with off and on since then (entirely at work, since it’s pricey stuff). In the past year I’ve been using it much more, and have learned quite a bit about it.

The biggest issue I have is that whenever I google about an issue, I either run into someone who’s using it in a database application, or someone who’s using it at a very low end in a small video configuration. This is a problem because neither of these scenarios fits our situation at work, and I have to experiment to figure out the solution to a problem, and more often than not it’s a shot in the dark — but it gives me an opportunity to really learn it and with luck impart more information about it.

The three-sentence description of fiber channel is basically this. Attaching a disk to a computer happens via some kind of storage interface — usually IDE/ATA (two disks max), SCSI (7 or 15 disks max), or SATA (one disk per port, more with port multipliers). USB and FireWire don’t count, since those convert one of IDE or SATA to USB or FireWire and then attach. Fiber channel is basically the result of someone saying “Hey Beavis, let’s take the disks out of computers, put them somewhere else, and then tie all the computer-to-disk connections together!”

The result of this is you get some of the great features of networks — with the right hardware, you can attach thousands of disks to a single computer. With the right hardware, you can transfer data screamingly fast. With the right software and hardware, you can add multiple links together and increase the speed of the connection. And with the right software, you can share a common set of disks between multiple computers.

The problem is that it brings with it the bad features of disk controllers. Most operating systems will scan for disks when they start up, and whatever they find, that’s what they expect to keep. They don’t like having new disks presented to them after bootup, and they *really* don’t like losing a disk that they found at bootup. Windows is a lot worse — it actually assumes that it owns any disk it sees, and writes a little tag to the beginning of each disk it finds if it doesn’t recognize the existing disk label.

I’ve spent the past year trying to wring performance and reliability out of several different fiber channel configurations at work, for video, database, shared storage, and other configurations, and have mostly succeeded, with lots of help from colleagues and vendors. I’ve learned a lot, and so has everyone involved; and I’ve not found a lot of references to most of the things we’ve learned, so I want to try and share it.

Next post on this topic — an introduction to our fiber channel switches.

as I’ve discovered, NetApps are reasonably rock solid

So as part of my job I’ve had to start taking the firehose crash course in Fiber Channel (or Fibre Channel, as some still put it) storage technology. It’s interesting, still somewhat relegated to the higher end of storage devices, but with Apple’s recent (okay, 3 year old) introduction of XSan and such to its systems for the video editing world, a lot of it has become somewhat more reasonable and non-Fortune-50-level. However, as I’ve found, there are very few people who really know their stuff when it comes to complicated fiber channel setups; at some point I’ll try to relay what I’ve learned. But that’s not for this post.

In this post, I would like to relay to you just how I discovered how robust NetApp storage really is. NetApp has been around for at least fifteen years (probably earlier); I’ve been interacting with their equipment in various capacities for almost ten. They’re basically dedicated file serving appliances; they’re loosely based around some customized Intel X86 architecture, with a very nice filesystem and operating system that tries as hard as it can to protect the data you store on it. Oh, and their systems are quite fast at serving NFS and CIFS traffic, too — some wag at work decided it might be a good idea to store old data on Buffalo Terastations, which, while fine for home users, really kind of pale in a multi-user situation. Copying data off of these things occurs at a maximum of about 7Mbytes/sec, whereas I’ve gotten upwards of 250Mbytes/sec going to a NetApp box, and it wasn’t even one of their higher end models.

NetApp isn’t the fastest network storage out there (that award would probably have to go to BlueArc), but they’re solid, reliable, and recover gracefully from pretty much any failure you can throw at it. NetApp is the company that first introduced me to the concept of “we know what’s wrong with your system before you do”. If a drive is failing (it doesn’t even have to have completely failed, just be throwing enough weird juju that the OS loses confidence in it), the OS will snap up a spare disk, start copying data from the failing drive to the spare drive, and then fail the bad drive and move the spare drive into the raid, all without you really noticing unless you’re paying attention to the log messages. This is the company for which the way you notice that a drive failed overnight is that a replacement drive is waiting for you as you arrive at the office the next morning.

(A note to the brave and/or foolhardy; don’t try this at home or at work, it’s certainly not supported by NetApp, and I got lucky. It may void your warranty, though I trust it won’t cause your NetApp to burst into flames.)

So, I was working with just such a system, adding some extra shelves to it as we’re trying to move towards 500G and 750G drive shelves, away from the older 275G and 320G shelves to increase density of this particular system. It uses fiber channel to connect from the head to the drives, which means that although the proper way to make changes to the system is to take it down, reconnect things as needed, then bring it back up, you can make certain changes (like adding disk shelves) on the fly.

I connected up the shelves together, wired them up to the file server, set the shelf IDs, and powered them on. The system recognized the new drives, and started doing its work to add them into the system as spare drives.

Except for one piece of stupidity on my part. I’d forgotten that shelf IDs are not like SCSI IDs; even though the little setting supports it, you can’t have a shelf ID of zero, and that the shelf IDs start at 1. Which meant that shelf “zero” wasn’t being recognized by the system.

So I figured, what the hell, they’re spare disks, let’s just power the shelves back off and reset the IDs. I’d noticed that the system was upgrading the firmware on one of the shelves, so I waited until this was done and it didn’t appear to be doing anything else to that chain of disks, powered them off, reset the IDs, and powered them back on.

The system did scream bloody murder (beeped, sent pages to the admins, and opened a trouble case with NetApp support), but after about five minutes of twitching, the system figured out that the 56 disks that had just disappeared had indeed reappeared, and that after a few bouts of convulsing and finally calming down, everything was really all right.

What impressed me most about this was the reasonably graceful manner in which the NetApp figured out what was happening and recovered from what for all intents and purposes was a catastrophic disk failure. I suppose the two things that made this less severe than it could have been were 1) the disks were all spares, not data disks, and 2) the disks were on their own fiber channel interfaces that didn’t have other disks on them as well. But my recent experiences with Macintosh computers on fiber channel (even if a disk’s not mounted by a system, if the Mac can see it over the fiber channel network and the disk goes away, the Mac will probably lock up at some point in the future if you don’t reboot it first) had made me wonder what would happen when I tried this. I would at least have expected the system to not re-recognize the disks the second time I powered them up.

I have to say that for as pricey as these things are (usually in the five to low six figures; though they’re not nearly as bad as some higher-end storage), they’re worth the amount you pay for them in purchase and support. The systems don’t go down, the support organization behind them is stellar, and they just plain work. The only thing I don’t like about them is their new logo (the monolithic ‘n’; it looks like a piece of a henge. You know, like stonehenge, woodhenge, and strawhenge). Beyond that, I’m quite happy with that equipment.

mysterious error messages, part 2

Here’s one that I just ran into; the results from a google search aren’t exactly helpful (no, you don’t need to reinstall the package because of this error).

After installing proftpd 1.3.0a, using a mostly-default /etc/proftpd.conf, on CentOS 4.4, you try to start it up and get the following error message:

- Fatal: ScoreboardFile: : unable to use '/var/run/proftpd.scoreboard': Operation not permitted on line 58 of '/etc/proftpd.conf'

The unhelpful error message doesn’t explain, like the comments in the source code do, that the scoreboard file should not be in a world-writeable directory. On CentOS 4.4, /var/run is world-writeable with the sticky bit (like /tmp) so that processes that don’t run as root can put their lock files in there.

Solution: create a new directory (I chose /var/lib/proftpd), chown it to the same user that proftpd runs as (the User directive in /etc/proftpd.conf), and make sure it’s mode 775 or similar. Then change the following line in /etc/proftpd.conf:

ScoreboardFile /var/run/proftpd.scoreboard

to

ScoreboardFile /var/lib/proftpd/proftpd.scoreboard

I should probably submit a patch to make a more helpful error message. But that won’t help the users with default installs who just run into this error.

mysterious error messages, part 1

This may or may not be the first in a series of posts in which a strange unknown error is found, and a non-obvious solution is found.

This particular error message came after creating a software RAID device:


# mdadm --create /dev/md7 --level=1 --raid-devices=2 /dev/sdc1 /dev/sdd1
mdadm: /dev/sdc1 is too small: 0K

I had just partitioned the disks with fdisk and set the partition type; sfdisk -l on the disk gave the correct output. Nobody else appeared to provide a solution to this, even though a couple of posts with the same query went unanswered.

It turns out for the first time ever for me, despite the perpetual fdisk warning, the partition table didn’t get reread properly by the kernel when fdisk wrote out the new table. This only happened on sdc, not on sdd.

I figured this out with mke2fs’s much more explanatory error message:


# mke2fs /dev/sdc1
mke2fs 1.35 (28-Feb-2004)
mke2fs: Device size reported to be zero. Invalid partition specified, or
partition table wasn't reread after running fdisk, due to
a modified partition being busy and in use. You may need to reboot
to re-read your partition table.

The fix was to run fdisk one more time, and just say ‘w’ to write out the partition table again, and (more importantly) make the ioctl() call again to have the kernel reread the partition table, this time properly. The next step would have been to reboot if that didn’t work, but I didn’t want to. (As Saif said in the previous post, rebooting is for adding new hardware.)

As I find more of these non-obvious error messages and the solution, I’ll try to post about them. Hope this helps someone out.

ghetto raid scrubbing with linux

As a follow up to my adventures with Linux RAID scrubbing (or lack thereof), I decided to poke around a bit more this weekend after a filesystem started throwing some errors.

It appears that someone did fix at least part of the issue I ran into — a memcpy() was left out of the repair kernel code — but I’m not planning on installing that kernel for a while. Not without some serious testing, or perhaps after it’s applied in a RedHat/CentOS update kernel.

However, I did come up with something that may work as a very ghetto software RAID1 verification technique. (The following keywords should help someone google this post: linux software raid verify scrub oh shit.)

Here’s what you do. First, find the size of the mirror from /proc/mdstat:

md5 : active raid1 hdd1[1] hdb1[0]
      58613056 blocks [2/2] [UU]

Multiply the number of blocks by 1024:

[root@linux] # bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
58613056*1024
60019769344

Then, run cmp on the two devices that make up the mirror:

[root@linux] #cmp /dev/hdb1 /dev/hdd1
/dev/hdb1 /dev/hdd1 differ: byte 60019769345, line 246090365

If the byte at which the two devices differ is a higher number than the one you came up with using bc, it means both mirrors contain the same data. (From what I can tell, that’s the area where the raid metadata/superblock sits, at the end of the disk.)

If the differ byte number is smaller, you can probably do a more extended test with cmp -l to find out what data differs and whether there are one or more differences. Not sure how to repair at that point; if you feel lucky, you might be able to do some kind of block editing (and guess the value that block should be), but I’m not about to try that part.

Part of the point of scrubbing is to read every byte of data from every disk and make sure there aren’t any read errors; if there are, it should throw a kernel error which shows up in logs, or with IDE might allow the drive firmware to reallocate a block that has a soft error in it (which will show up in smartd’s output).

Note that this will only work with RAID1; RAID5 lays out data differently, in stripes of data and parity, so you’d have to do parity calculations as well as figure out where they are. It could probably done with some programming, but that’s left as an exercise for the reader }:>.

So yeah, it’s really ghetto, but it appears to work. And now I don’t feel like I’m flying 100% blind and not knowing whether my mirrors are really mirrors. If I feel industrious, I’ll probably put this into a shell script and start running it weekly or something.