The SSD Guy has been asked a number of questions lately about SSDs and RAID. Most of these center around the difference in failure behaviors between SSDs and HDDs – HDDs fail randomly (if ever), while SSDs fail relatively predictably due to wear.
Oddly enough, SSD failures due to wear make them a little friendlier than HDDs. The wear mechanism is managed by the controller in the SSD. SSDs have spare blocks, and the controller manages those blocks, so the controller understands exactly how much wear the SSD has undergone and how much room is left before the SSD will start to have difficulties.
Furthermore, an SSD that has failed due to wear locks itself into a read-only mode: The entire SSD’s data can be read and copied into another drive, even though the SSD no longer accepts writes.
The information on the wear status of the SSD is usually made available to the user through the SMART reporting standard that is now being embodied in both SSDs and HDDs. Today there are few programs that automatically interrogate SMART statistics, so it is up to the system administrator to proactively check to see how the SSDs are doing. That said, if the SysAdmin monitors the SSD regularly then the SSD’s wear-out date can practically be marked on a calendar.
So far this all sounds good, but in a RAID configuration this can cause trouble. Here’s why.
RAID is based upon the notion that HDDs fail randomly. When an HDD fails, a technician replaces the failed drive and issues a rebuild command. It is enormously unlikely that another disk will fail during a rebuild. If SSDs replace the HDDs in this system, and if the SSDs all come from the same vendor and from the same manufacturing lot, and if they are all exposed to similar workloads, then they can all be expected to fail at around the same time.
This implies that a RAID that has suffered an SSD failure is very likely to see another failure during a rebuild – a scenario that causes the entire RAID to collapse.
There are two very simple means of avoiding this problem: The first is to monitor the SSDs’ health using the SMART protocol. As we mentioned earlier, this will allow the SysAdmin to determine when a problem is about to manifest itself and to take steps to prevent a calamity. An alternative, especially if an existing RAID system is being converted to SSDs, is to introduce SSDs in phases: If one HDD is replaced with an SSD every month, then the SSDs will be likely to fail at monthly intervals rather than simultaneously.
Another possibility is to use SSDs from a number of vendors. This would add significant unpredictability to the RAID’s failure, but it is a workable solution.
In a nutshell, SSDs fail from wear, but unlike HDDs they fail predictably and with plenty of warning. With appropriate new disciplines they can actually improve system security in a RAID environment.
The problem is when a SSD fails alltogether because of some sort of malfunction. Lets say some circuit on the board burns out. Then the entire disk is gone, not recoverable. That is why RAID 1 is still good for SSDs. So you have the entire storage mirrored on a second disk. It’s not about wear and tear or loss of performance. It’s about data integrity. And you can’t set up automatic regular backups when your SSD is encrypted. So with RAID 1 you get data integrity while preserving data security.
Michael, Thanks for the comment.
The point of this post was to show the difference between HDD and SSD failures in a RAID setup. When an SSD fails from a circuit board failure, then it’s basically the same as when an HDD fails due to a circuit board failure.
The big difference between HDDs and SSDs in RAID configurations stems from SSDs’ wear mechanisms, which don’t exist in HDDs.
I am not sure what you mean when you say: “You can’t set up automatic regular backups when your SSD is encrypted.” There’s really no reason for encryption to get in the way of backups. Maybe I am misunderstanding – could you explain that further?
Thanks for the comment,
Jim