Is this the year SSDs will get cheaper than disk?

I’ve been wanting to start tracking this for a while; we’ll see how well it works. Solid state disks have been showing up more and more and getting cheaper and cheaper; but they were still a good deal more expensive than traditional “spinning rust” hard disks. Then, the unfortunate Thai floods started in July (and are apparently some areas of Thailand are still flooded), affecting both Seagate and Western Digital hard drive factories. Since those two companies now encompass most of the world’s hard disk manufacturers — Western Digital is on track to finish acquiring Hitachi’s manufacturing, and Toshiba, Fujitsu, and Samsung are still there but represent a relatively small portion of the market — it had a huge impact on the price of spinning drives, and has already started affecting availability. In late October, drives started doubling in price, and people started making runs on stores; in late November, it was hard to find certain drives at NewEgg or Fry’s.

So here’s my attempt to start comparing drives, both raw street price and by cost/GB, to see when SSDs become a cost-effective option against spinning disks. We’ll see if it actually happens this year.

As of 1/15/2012, we’ve still got a ways to go:

Size Type Grade Manufacturer Model Cost Cents/GB
2TB Spinning SATA Consumer Seagate ST2000DL003 $145 7.25
2TB Spinning SATA Consumer WD WD20EARS $149 7.45
2TB Spinning SATA Consumer Hitachi 7K2000 2TB $160 8.00
2TB Spinning SATA Enterprise Hitachi A7K2000 2TB $450 22.5
2TB Spinning SATA Enterprise Seagate ST32000644NS $408 20.4
3TB Spinning SATA Consumer Seagate ST3000DM001 $232 7.73
3TB Spinning SATA Enterprise Seagate ST33000651NS $498 16.6
3TB Spinning SATA Enterprise Hitachi 7K3000 3TB $652 21.73
600GB Spinning SAS Enterprise Seagate ST3600057SS $492 82.00
600GB High-speed SATA Consumer WD WD6000HLHX $288 48.00
600GB Spinning SAS Enterprise Hitachi HUS156060VLS600 $457 76.16
200GB SLC SSD Enterprise Seagate ST200FX0002 $2,907 1453.50
200GB MLC SSD Enterprise Seagate ST200FM0012 $1,504 752.00
200GB MLC SSD Consumer OCZ OCZSSD2-2VTX200G $308 154.00

(Note, these aren’t price guarantees; they’re going to fluctuate constantly. The price links are to the searches where I found the prices listed. I’ll try to update once every couple of weeks to see where things go.)

Musings on version control and object stores

I’ve been looking at object storage systems (think Amazon S3, Swift, “cloud storage”) lately to see if they solve some storage problems, and I’m not particularly pleased with where any of them are right now. I realize it’s still early days for them, but all the ones I’ve looked at have some development time in store for them. Most of them eschew traditional storage systems, opting instead to use raw hard disks and replicating objects to protect against hardware issues, and keeping track of objects via an internal database or hashing mechanism.

The thing I can’t get past is that compared to traditional storage (RAIDed, high availability or clustered) and traditional databases (mysql, linter, filesystems like xfs, ext3), they all seem to want to reinvent the wheel. The object stores aren’t necessarily better for it; to get performance and protection you have to scale up the number of servers and hard drives to the point where it’s reasonable to ask whether or not it makes sense just to buy a traditional storage system. Object stores give you 33% storage efficiency; traditional RAID6 gives you 67%.

Separately, I was reading about the git source control software and got to thinking: it looks a heck of a lot like a simple object store, where the files are stored as objects, and ‘versions’ or ‘branches’ of a project are just different lists of objects.

If you take Swift as an example, and instead of deploying it on raw hard drives, use some kind of RAIDed and clustered storage (e.g. NetApp, BlueArc, EMC, maybe even Isilon or lustre) where the data protection (and recovery) mechanism is offloaded from Swift, you should be able to turn the number of object replicas down to one, instead of three. Storage protection solved by the backend system; and if a Swift node (either storage or proxy) dies, spin up a new server and point it at the storage system.

At that point, Swift is effectively storing files, calculating their checksums and returning a key to the client to use as an object key, keeping metadata (tags) if you like, and serving the files back on request. You can turn off the replication mechanism, since the storage is taking care of that for you. Keeping the scrubber running to make sure objects match their checksum probably isn’t a bad idea if you’re paranoid, but also shouldn’t be necessary (again, the storage takes care of that for you). But whether Swift is running on protected storage or raw hard drives, you can’t (easily) provide an index of the available files, or replicate them geographically (from, say, the New York server to the Los Angeles server).

Bring in git. git stores all files as objects, named by their SHA1 checksum, and keeps lists (“trees”) of these objects and of sub-trees to represent a directory structure. Update a file or few, and a new tree is created to represent the new structure and include the new objects.

Git apparently isn’t very good at large repositories of files because of how it stats every file in a tree during certain operations, while Swift uses a hashing mechanism to keep it quick. But if you serve up the git objects via a web server (perhaps with a module to strip out the git header and to look into git object packages) and primarily store or retrieve objects, you can use Basic or Digest authentication to protect the content (for GETs or PUTs), and (in theory) can use ‘git clone’ and ‘git pull’ to keep remote replicas up to date. If you had a mechanism for dealing with conflicts (or just never allowed file modification) you might be able to use this mechanism to allow multimaster replication between sites. And Git handles full-file deduplication by keeping only one copy of an object — if a second copy is added, the SHA1 checksum already exists, so no need to write a second file.

Richard Anderson at Stanford has already taken a look at this and listed out what he’s found when considering git as an object store, with some good detail on the strengths and weaknesses.

Git and Swift both have their strengths and weaknesses for specific applications; but maybe the Swift folks can take some direction from git, or maybe someone can address some of the issues with creating a git-based object store. Perforce is also good at holding very large source trees; they keep a database of the repository and use a proprietary client/protocol server, but maybe they’d be interested in seeing if a p4-based object store makes sense?

amuse-cerveau for October 17, 2011

Things which amused or interested me today: