When Disks Crash…

There are many things to discuss on the subject of maintaining your aging digital zoo, but I will share about some recent struggles with aging data storage.  It’s just a fact of computer operations that hard disks crash.  I have associates who work in large datacenters and they forecast disk replacements by the month – if you’ve got 20,000 spinning disks, some number of those are going to need to be replaced every month.  Now if you only have a handful of disks under your digital husbandry, you may go for years without having to replace one, but you definitely need to be prepared.

If you are properly prepared, having a disk crash should not be any worse than a minor fender-bender.  Preparation falls into two categories: redundancy and backups.  Redundant disks are some form of RAID (Redundant Array of Inexpensive Disks) – the simplest being a mirrored disk (RAID 1) and the more advanced is spreading redundant data over multiple drives (RAID 5).  Not all RAID is redundant, and it is an easy subject to research online (start with this Wikipedia article).  I personally have only needed either 1 or 5.  When you are using redundant RAID and a disk crashes, all your data is still there and in most cases and you can continue to use it.  After a crash, the next step is to find a suitable replacement and replace it.  If all goes well, the RAID controller will rebuild the redundancy on the replaced disk and you’ll be fully protected again.

Sometimes redundancy is sufficient, such as for a system disk, but as a good steward, you make backups of all important data.  If it’s really important data, you keep at least two backups, normally done by simply rotating your backup target.  That means you don’t directly overwrite your last good backup with the next one – you make your next backup on a different media or target storage, preserving your previous one.   Also, you should keep that second backup at a different location, just in case a disaster hits.  There’s an infinite number of variations on how to make backups – burning DVDs or Blu-Rays, copying to USB drives, using backup software for backup and restoration, one PC to another, etc.   It’s more important that people do it, than it is how they do it.

Now with that background covered, let me get to the point of the story.  A disk crashed on my old Buffalo 1TB Terastation two weeks back.  A very nice design – it sent me an email telling me that Disk 2 had an error.  Fortunately, I was prepared and this was no worse than a minor fender-bender, but as those of you who have had such accidents know, it can still be very inconvenient and annoying.  The Terastation is a NAS (Network Attached Storage) which I had configured in RAID 5, so it held about 750GB with redundancy on the four internal disks.  I also had a 500GB USB disk attached, onto which the important contents was copied nightly.  I rotate this disk periodically to my storage unit, so there is always a copy offsite, and the one attached to the Terastation is part of my “hurricane” kit.  Living in Florida, we sometimes have to evacuate, and grabbing this drive is part of that evacutation plan.

So my first discovery after the crash is that the RAID controller in the Terastation shuts down when it detects an error – it won’t do anything until you replace the disk.  I guess this is for “safety”, but it’s not what I was hoping for.  So, I take the backup USB disk and connect it to another computer.  I could have made that drive the new master for further updates, but since I had an extra 1TB free on my Ubuntu server, I just copied everything from the USB to the shared drive on that server.  Now everyone can continue using it the same way as when it was on the Terastation.  That leaves the USB drive still as a backup, even though it’s starting to get a bit “stale” now.  This could be compared to the “exchange of insurance information” stage in a fender-bender accident – it’s what you need to do in order to get back on the road.

Next comes the repair – estimating the damage and getting it fixed.  Having a working backup freed me from worrying about restoring the RAID array, although I would really like to see the restore process work – there’s no apparent reason it shouldn’t, so I am going to try that.  At this point, you could take this Terastation to any computer repair shop, much like you would bring your dented fender to a body shop.  But being a tinkerer and knowing this should be easy, I wanted to replace the drive myself.  Opening it up (the manual actually shows very clearly how to do this – nice job, Buffalo), I found that this particular Terastation uses four 250GB Samsung SP2514N disks.  I was able to locate a number for sale, although not too cheap, this being from the 2005 era.  Remember, a year for computer stuff is like a decade in the real world – looking for a drive from 2005 is like looking for a radiator for a 1950 Chevy.  Now I knew that I there were newer substitutes which would work, but I liked the idea of staying “original”, so I ordered the part from, of course, the lowest price supplier.

A few days later the replacement disk arrived and I installed it.  The RAID controller recognized a new disk, asked if I wanted to rebuild the array on it, and away it went, reconstructing the data on the replacement drive.  If you are familiar with RAID, this is not a quick process, especially with the light-weight ARM processors in these NAS units – I figured it would take about a day.  After maybe 12 hours, trouble hit – the RAID controller emailed me again about an error on Disk 2 – that was the one I just replaced!  When you are trying to troubleshoot a problem, you need to isolate the symptoms – with this error, I now could not tell if the problem was with the drive or the NAS controller.  So, I pulled out an old test computer from storage and downloaded a very handy Hard Drive Diagnostic utility from Samsung.  I plugged in the drive and ran the test from a boot floppy – errors screamed across the screen.  I called the supplier and they were courteous and offered to exchange it or refund me.  I also found out the reason it was less expensive was because it was pulled from excess equipment (duh – I should have known this!).  I called a number of other suppliers and found that all of the Samsung SP2514N disks on the market were used equipment, the best one being less than 10 hours powered on.  At this point I decided to rethink my solution.  I would prefer to do the necessary spec comparison and get a brand-new similar disk than run a used identical one.  Remember, all disks will crash eventually, and a used disk is just closer to that time. 

I then ordered a new Western Digital 500GB, with the hopes of eventually upsizing my 1TB (or really 750GB) to a 2TB unit.  This model can support up to 500GB disks, so I can later replace the other three with the same 500GB spec and rebuild the RAID array at twice the current size.  It’s kind of like finding a brand-new 1990 Chevy radiator that sort of fits in the 1950 Chevy, instead of pulling an original one out of the junk yard – the new one is much less likely to leak!

In the mean time while waiting for the next replacement disk, I decided to run my original crashed Samsung disk through the same HDD testing software.  Surprisingly, it passed with flying colors!  So I installed it back in the Terastation just out of curiosity, and started the rebuild of the array again.  Sadly, this too failed the same as the other defective Samsung which I had shipped back, but now I’m a little more concerned.  Either the drive was defective in a way not detected by the test software (which only performed non-destructive, read-only testing), or the redundant array has been lost and the controller cannot rebuild it, or the array controller is defective. 

Now I am waiting for the next replacement and I will test the rebuild option when it comes.  If that doesn’t work, then I will try to delete and re-create the array, and if that doesn’t work, I’m afraid it’s going to be good-bye for the old Terastation.

To be continued…

Leave a Reply