How Controllers Maximize SSD Life – Improved ECC
Error correction (ECC) can have a very big impact on the longevity of an SSD, although few understand how such a standard item can make much difference to an SSD’s life. The SSD Guy will try to explain it in relatively simple terms here.
All NAND flash requires ECC to correct random bit errors (“soft” errors.) This is because the inside of a NAND chip is very noisy and the signal levels of bits passed through a NAND string are very weak. One of the ways that NAND has been able to become the cheapest of all memories is by requiring error correction external to the chip.
This same error correction also helps to correct bit errors due to wear. Wear can cause bits to become stuck in one state or the other (a “hard” error), and it can increase the frequency of soft errors.
Although it is not widely understood, flash endurance is the measure of the number of erase/write cycles a flash block can withstand before hard errors might occur. Most often these failures are individual bit failures – it is rare for the entire block to fail. With a high enough number of erase/writes the soft error rate increases as well due to a number of mechanisms that I won’t go into here. If ECC can be used to correct both these hard errors and an increase in soft errors then it can help lengthen the life of a block beyond its specified endurance.
Here’s an example: Let’s say that an unworn NAND chip has enough soft errors to require 8 bits of ECC – that is, every page read could have as many as 8 bits that have been randomly corrupted (usually from on-chip noise.) If the ECC that is used with this chip can correct 12 bit errors then a page would have to have 8 soft errors plus another 5 wear-related errors to go beyond the ECC’s capability to correct the data.
Now the first of those 5 failures is guaranteed by the flash manufacturers to occur sometime after the endurance specification: No bits will fail until there have been 10,000 (or 5,000, or 3,000…) erase/write cycles. Specifications are not sophisticated enough to predict when the next bit will fail, but there may be several thousand more cycles before that occurs. This implies that it will take significantly more than the specified endurance before a page becomes so corrupt that it needs to be decommissioned. This means that the error-corrected endurance of the block could be many times the specified endurance, depending on the number of excess errors the ECC is designed to correct.
This all comes at a price. More sophisticated ECC requires more processing power in the controller and may be slower than less sophisticated algorithms. Also, the number of errors that can be corrected can depend upon how large a segment of memory is being corrected. A controller with elaborate ECC capabilities is likely to use more compute resources and more internal RAM than would one with simpler ECC. These enhancements will make the controller more expensive.
ECC algorithms are their own special other-worldly state of mathematics, and although The SSD Guy is pretty comfortable with math, I don’t even try to understand the difference between the more basic Reed-Solomon coding and LDPC. When someone talks to me about the Shannon Limit (the maximum number of bits that can be corrected) my eyes glaze over. Suffice it to say that I am awestruck by the intelligence of the folks who have mastered ECC, and their ability to extract more life out of a flash block than any mere mortal would think possible.
Just remember that more bits of error correction leads to a longer usable life for a flash block before it needs to be decommissioned.
This post is part of a series published by The SSD Guy in September-November 2012 to describe the leading methods SSD architects use to get the longest life out of an SSD despite the limited number of erase/write cycles that NAND flash specifications guarantee. The following list provides the names of all of these articles, and hot links to them:
- Wear Leveling
- External Data Buffering
- Improved ECC
- Other Error Management
- Reduced Write Amplification
- Over Provisioning
- Feedback on Block Wear
- Internal NAND Management
Click on any of the above links to learn about how each of these techniques works.
Alternatively, you can visit the Storage Networking Industry Association (SNIA) website to download the entire series as a 20-page booklet in pdf format.