Down One Drive

Well, the fresh spring heat killed one of the drives on this server, so now we’re running on the spare. I’m debating if I order an SSD to replace the system drive whether I really need to do mirroring anymore? Like anything, SSDs can fail too, but is it an improbable enough event to justify the cost of a second disk and the write overhead of mirroring? But then again, if I had relied on a single disk before, we’d be offline until Amazon could get me another disk.

I have generally not bothered to mirror system drives if they are on SSD, believing they are reliable enough to use alone for non-write intensive applications. Or at least no less reliable than the other components in the server that you don’t have redundancy for. Fortunately, Linux has TRIM support for RAID1, so at least now I have the option. But should I bother?

21 thoughts on “Down One Drive”

  1. Yes, you still need to mirror even with SSD, I have had several fail in various new servers.

    1. Were they new when they failed? This is why I hate that I don’t do HPC anymore. I have no experience with SSDs across a very large number of systems.

      1. Yes, it seems if the SSD’s fail, they fail early, kind of opposite of standard drives.

  2. “Data you do not back up is data you do not want to keep”
    -LucusLoc, Genuine System Admin.

    The moment you plan a single point of failure (because that failure point is “so reliable”) is the moment that failure point will fail. RAID that sucker.

    1. Everything is backed up, so we’re good there. I’ve successfully restored from my backups so I know they work.

      Most servers have plenty of single points of failure. We did redundancy with spindle disks because they are electromechanical and by design far less reliable than the rest of the system. The question is, statistically, am I taking any bigger risk having one SSD than I am having one power supply, one CPU, or one mother board?

      1. Usually, a failure in any components other than disk costs only time. Disk failure loses data, even if only that written since the last backup.

        My vote is for RAID – having lost two SSDs in different machines which saw only light use.

      2. Yeah, I come from the enterprise world. One of anything is not enough. We have whole backup servers, and each server has hot swap everything, from the disks right down to the CPUs. But even if you don’t have fancy high-availability everything, The HDD is where the data lives, and that is the only thing that is important. Everything else can be lost and replaced with no permanent damage, but a lots disk is lost data, unless that data is either in a RAID or has not changed since the last backup. Lost data is gone forever. You may not thin it is that importation, until you write that post of the century (or someone makes the comment of the century) and the server melts five seconds later. Hell, I get antsy when I’m waiting for my documents to get backed up for the first time (RAID for work laptops is somehow not a priority. . . ).

  3. If it’s a server at least buy server grade SSDs. I second the opinion that you still need to mirror or otherwise RAID it. Or look into getting one of those PCI SSD cards.

  4. From what I’ve read, once an SSD starts going bad, it’ll brick – you can’t recover anything once it does that.

  5. I’m curious. When you say “The fresh spring heat”, are you being serious, or are you being sarcastic?

    I only ask, because here in Utah, last night it was flirting with snowing. (Mostly just a few flakes…but it’s a few flakes in May!) We’re in the process of moving (again! sigh), so I made the comment last night “I didn’t want to move in the winter!”.

      1. That’s the kind of weather I should generally be expecting here about this time of year…

  6. First, you don’t need an SSD unless you are doing heavy read-intensive applications with datasets larger than cache – and this usually applies to dynamic datasets that are unique per session. This is not your site. In your case, increase overall cache if you have a responsiveness issue. Your data is static so no problems. Done.

    Also, SSDs fail. Don’t buy the hype.

    And…creating a mirror with one SSD and one spinning platter is a nightmare. Don’t do it. You will not see any benefit, and the varying drivers will probably gack something sooner or later. Mirrors are “mirrors”, meaning they should be pretty close to identical. Yes, you can (and cheapo people often do) use alternate geometries, but you are just begging for trouble.

    FWIW, I have built out clustered systems with 14,000+ CPUs and many more thousands of disks all doing one job at the same time. I kinda know this shit, and apply the rules for big systems to my small ones. Life is easier that way. I treat my barely-visible websites like the big stuff and relax.

    Upside is it won’t cost you much money to just get two new disks from Newegg.com and be done with it. Better yet, buy three and stash the spare in a drawer. Or do like an employee of mine once did for his personal servers: duct tape it to the server so it never got lost. I still chuckle at that one.

      1. Yes, sorry for the long delay in responding. I’ve been offline with the kids a lot the past few days.

        We use SSDs for cache and indexer data and use spinning platters for bulk. It’s a pretty typical architecture these days for larger-scale “Big Data” type stuff.

        I don’t have the numbers in front of me (and probably cannot quote them due to NDA with the makers, anyway) but SSDs have an initial failure rate higher than spinning disks. But if they survive first contact with data, they generally hold years before issues creep up. The truth is that the underlying controller technology has changed enough that a five-year old SSD is not comparable to one made last year in the parts that count towards MTBF calculations. So basically we are modeling those (i.e. ‘Watch, wait and tell the clients to have spares’).

        Spinning disks (including “hybrids”) generally have a linearly increasing rate of failure over years, while SSDs have a slight bump at the start and then lay pretty flat for a few years. Then failures start to climb a bit. I think they are more reliant for some operations (read-intensive ops) and my code hits them hard. They tend to hold up. Again, I don’t know what that means for the SSDs we saw deployed last Fall. The manufacturers all claim they are more reliable than past devices, and they are probably correct.

        I’d say get a few SSDs for your server, if you haven’t already. Mirror them out. Consumer SSDs are pretty cheap and it’s not like you need more than a few GB. I ordered an off-the-shelf notebook from Amazon and replaced the stock drive with a new SSD before hitting the power button the first time. They excel in notebooks and home computers.

        My initial point was that whatever you do, you probably want two of them to avoid headache down the road (or three if you want a spare). SSDs are generally more reliable, but they are not bomb-proof.

        1. We use SSDs for cache and indexer data and use spinning platters for bulk. It’s a pretty typical architecture these days for larger-scale “Big Data” type stuff.

          That’s what I’m using too. But that’s a bit more than I need for a WordPress server.

          I don’t have the numbers in front of me (and probably cannot quote them due to NDA with the makers, anyway) but SSDs have an initial failure rate higher than spinning disks.

          That’s what I’m finding too.

          My initial point was that whatever you do, you probably want two of them to avoid headache down the road (or three if you want a spare). SSDs are generally more reliable, but they are not bomb-proof.

          Yeah, I’ve been convinced to continue using a mirrored pair even if I order SSDs.

  7. For context, in the Gen9 HP servers, we boot our ESX servers off a single SD card. They don’t reboot often, we can afford to lose a host temporarily, and have backup SD cards if a card goes bad.
    But in that case, it’s basically loading the OS into memory and is done. So you can see how the characteristics of its use afford us the luxury of that implementation.
    If you’re running Windows or Ubuntu GUI as your server OS (something characterized by more random R/W, swapping, etc), I would use mirrored SSDs for system drive if that’s an option, and high-speed physicals for your data array(s) and make sure server has plenty of memory. The SSD will provide great performance for the OS, but they do go bad. In fact, more often than physical failure, I’ve seen problems with SSDs in the desktop world (just factor of quantity and variety) because of either its firmware or the driver for the OS. Nothing that wasn’t recoverable (IME), but massive performance hits. But I think this would be apparent pretty quickly after it goes into service.

  8. My own SOP is to put the system on SD Cards or internal USB headers. Most Dell PowerEdge and HP servers will support a paired SD card arrangement. Given how inexpensive SD cards are, this has the advantage of being cheap, reliable, and easy to fix. As in, if an SD card fails, I can just run off the other and replace the bad one. If both fail I just copy my latest backup to an SD card and insert it into the server.

    This is how I’ve been doing all my Hyper-V and ESXi deploys of late. The guest OS storage is either SD’s / USB’s in RAID1 or spinning rust in RAID6.

    Sometimes weird stuff happens. I had a couple early-on Kingston SSD drivers that were absolute disasters. I’ve lost in the past couple years two additional SSD’s, one an Intel of some flavor and the other some no-name Chinese device I bought on eBay.

    In any case, mirroring is better.

  9. I’m a retired ee/reliability engineer; just like guns, two is one and one is none.

Comments are closed.