What Happens when SSDs Fail?
There’s a lot of “Fear, Uncertainty, and Doubt” – FUD – circulating about SSDs and their penchant for failure. NAND flash wears out after a set number of erase/write cycles, a specification known as the flash’s endurance.
While some caution is warranted, a good understanding of how SSDs really behave will help to allay a lot of this concern.
Today’s SSDs are likely to last longer than they will be needed in almost any application. This would not seem to be the case from the limited information that many hesitant IT managers know about flash. Yes, NAND flash does suffer from a limited number of erase/write cycles, but much of this is very well hidden from the user through management schemes that will be saved for a later post. This means that the 100,000, 10,000, or even as few as 7,000 writes a that block is guaranteed to endure before failing is extended by the controller to numbers in the millions. What this means to the user is that blocks are significantly less likely to fail than one might expect.
Even so, today’s controllers take steps that help the IT manager to manage wear and avoid potential outages. First, since the wear mechanism is well understood, the controller can keep track of each block and can report back to the user when the SSD is approaching the end of its useful life. This information is available through ANSI’s standard ATA SMART commands (Self-Monitoring, Analysis and Reporting Technology). The only drawback is that current storage management software doesn’t proactively put this information in front of the operator, so it is up to the operator to interrogate each SSD to determine when it should be replaced.
Eventually this task will be automated and an alarm will warn the operator when an SSD is approaching its limit.
Second, when an SSD is allowed to run right to the end of its useful life, most controllers put the SSD into a “Read-Only” state that allows the operator to remove the SSD, copy its contents onto another device, then re-start operations in the shortest amount of time. This is a far cry from the total loss of an HDD. The SSD could even be replicated as an alternative to performing a slower RAID rebuild.
It is helpful, though, for the IT professional to have a good understanding of the workload in the targeted system, especially to understand how much write activity is occurring. Many SSDs are starting to adopt endurance specifications stating that the SSD can be completely over-written 10 times a day for its warranty period without failing. As long as the write load is below this number than there should be no reason to worry that the SSD will wear out in the system.
More detailed information on flash wear can be downloaded in the white paper NAND Flash Storage for the Enterprise – an In-Depth Look at Reliability posted on the Objective Analysis Home page.