The SSD Guy has been asked a number of questions lately about SSDs and RAID. Most of these center around the difference in failure behaviors between SSDs and HDDs – HDDs fail randomly (if ever), while SSDs fail relatively predictably due to wear.
Oddly enough, SSD failures due to wear make them a little friendlier than HDDs. The wear mechanism is managed by the controller in the SSD. SSDs have spare blocks, and the controller manages those blocks, so the controller understands exactly how much wear the SSD has undergone and how much room is left before the SSD will start to have difficulties.
Furthermore, an SSD that has failed due to wear locks itself into a read-only mode: The entire SSD’s data can be read and copied into another drive, even though the SSD no longer accepts writes.
The information on the wear status of the SSD is usually made available to the user through the SMART reporting standard that is now being embodied in both SSDs and HDDs. Today there are few programs that automatically interrogate SMART statistics, so it is up to the system administrator to proactively check to see how the SSDs are doing. That said, if the SysAdmin monitors the SSD regularly then the SSD’s wear-out date can practically be marked on a calendar.
So far this all sounds good, but in a RAID configuration this can cause trouble. Here’s why.
RAID is based upon the notion that HDDs fail randomly. When an HDD fails, a technician replaces the failed drive and issues a rebuild command. It is enormously unlikely that another disk will fail during a rebuild. If SSDs replace the HDDs in this system, and if the SSDs all come from the same vendor and from the same manufacturing lot, and if they are all exposed to similar workloads, then they can all be expected to fail at around the same time.
This implies that a RAID that has suffered an SSD failure is very likely to see another failure during a rebuild – a scenario that causes the entire RAID to collapse.
There are two very simple means of avoiding this problem: The first is to monitor the SSDs’ health using the SMART protocol. As we mentioned earlier, this will allow the SysAdmin to determine when a problem is about to manifest itself and to take steps to prevent a calamity. An alternative, especially if an existing RAID system is being converted to SSDs, is to introduce SSDs in phases: If one HDD is replaced with an SSD every month, then the SSDs will be likely to fail at monthly intervals rather than simultaneously.
Another possibility is to use SSDs from a number of vendors. This would add significant unpredictability to the RAID’s failure, but it is a workable solution.
In a nutshell, SSDs fail from wear, but unlike HDDs they fail predictably and with plenty of warning. With appropriate new disciplines they can actually improve system security in a RAID environment.