How Controllers Maximize SSD Life – Improved ECC

Tempus FugitError correction (ECC) can have a very big impact on the longevity of an SSD, although few understand how such a standard item can make much difference to an SSD’s life.  The SSD Guy will try to explain it in relatively simple terms here.

All NAND flash requires ECC to correct random bit errors (“soft” errors.)  This is because the inside of a NAND chip is very noisy and the signal levels of bits passed through a NAND string are very weak.  One of the ways that NAND has been able to become the cheapest of all memories is by requiring error correction external to the chip.

This same error correction also helps to correct bit errors due to wear.  Wear can cause bits to become stuck in one state or the other (a “hard” error), and it can increase the frequency of soft errors.

Although it is not widely understood, flash endurance is the measure of the number of erase/write cycles a flash block can withstand before hard errors might occur.  Most often these failures are individual bit failures – it is rare for the entire block to fail.  With a high enough number of erase/writes the soft error rate increases as well due to a number of mechanisms that I won’t go into here.  If ECC can be used to correct both these hard errors and an increase in soft errors then it can help lengthen the life of a block beyond its specified endurance.

Here’s an example: Let’s say that an unworn NAND chip has enough soft errors to require 8 bits of ECC – that is, every page read could have as many as 8 bits that have been randomly corrupted (usually from on-chip noise.)  If the ECC that is used with this chip can correct 12 bit errors then a page would have to have 8 soft errors plus another 5 wear-related errors to go beyond the ECC’s capability to correct the data.

Now the first of those 5 failures is guaranteed by the flash manufacturers to occur sometime after the endurance specification: No bits will fail until there have been 10,000 (or 5,000, or 3,000…) erase/write cycles.  Specifications are not sophisticated enough to predict when the next bit will fail, but there may be several thousand more cycles before that occurs.  This implies that it will take significantly more than the specified endurance before a page becomes so corrupt that it needs to be decommissioned.  This means that the error-corrected endurance of the block could be many times the specified endurance, depending on the number of excess errors the ECC is designed to correct.

This all comes at a price.  More sophisticated ECC requires more processing power in the controller and may be slower than less sophisticated algorithms.  Also, the number of errors that can be corrected can depend upon how large a segment of memory is being corrected.  A controller with elaborate ECC capabilities is likely to use more compute resources and more internal RAM than would one with simpler ECC.  These enhancements will make the controller more expensive.

ECC algorithms are their own special other-worldly state of mathematics, and although The SSD Guy is pretty comfortable with math,  I don’t even try to understand the difference between the more basic Reed-Solomon coding and LDPC.  When someone talks to me about the Shannon Limit (the maximum number of bits that can be corrected) my eyes glaze over.  Suffice it to say that I am awestruck by the intelligence of the folks who have mastered ECC, and their ability to extract more life out of a flash block than any mere mortal would think possible.

Just remember that more bits of error correction leads to a longer usable life for a flash block before it needs to be decommissioned.

This post is part of a series published by The SSD Guy in September-November 2012 to describe the leading methods SSD architects use to get the longest life out of an SSD despite the limited number of erase/write cycles that NAND flash specifications guarantee.  The following list provides the names of all of these articles, and hot links to them:

Click on any of the above links to learn about how each of these techniques works.

Alternatively, you can visit the Storage Networking Industry Association (SNIA) website to download the entire series as a 20-page booklet in pdf format.

 

11 thoughts on “How Controllers Maximize SSD Life – Improved ECC”

  1. One expression/interpretation of the Shannon Limit is the amount of bits (entropy) that a system can reliably resolve from a block of media, so if a device bit error rate is initially 2% and you design it to store 1088-byte blocks, the Shannon Limit given those assumptions lets you store 1066.24 bytes with average reliability (50:50 odds it is accurate), at this instant in time. Of course we would like a little better than that, lol.

    For NAND the Shannon Limit is related to the number of electrons stored in the charge on the floating gate. Now consider that charge is leaking from the flash cells slowly over time, which is reducing the Shannon Limit for that media over time. That’s why SLC/MLC/TLC/QLC each have different trade-offs in reliability vs capacity for the same physical cell characteristics. If you stuff 1000 electrons in there and you lose 1 per day and the trigger voltage is achieved at 200 electrons +-20 then for SLC you can reliably leave the data in that cell for 780 days the first time you write it. For MLC you might trigger at 330 580 and 830. Each time you write it you can stuff less and less electrons in there because it gets damaged, so pretty soon you may not be able to even get 830 electrons in there any more. You can correct for that by keeping some stats on the nominal and peak values are for the cells in a block and you can estimate them based on its erase count and success rates decoding the ECC symbols at different trigger points for each voltage level for the block and then use that to tweak the trigger points for future reads before feeding it to the ECC… anyhow as you have less electrons in there, and the amount varies more and more from cell to cell within the block, the Shannon limit is dropping. You no longer have 50:50 odds that your data is accurate. MLC/TLC/QLC exacerbate that situation by effectively dividing down the number of electrons per bit but without improving the distribution of differences per cell, so reliability drops off sharply, (and durability with it because it forces the controller to keep refreshing the blocks by re-writing them, much sooner than with SLC.)

    So at the start you have the ability to discern 1066.24 bytes with 50:50 certainty. Over time the Shannon Limit is dropping due to leakage, with a matching ever-increasing failure pattern across the cell block. The failure mode is the inability of the sensing hardware to discern between 2 storage charge levels for the cell. This is true for SLC or MLC+, but the failure rates for each charge level are statistically different for each cell block due to physical fabrication differences and differences in the data written to each cell.

    The big difference between Reed-Solomon and LDPC is that LDPC takes into account LDPC takes advantage of understanding the failure mode of MLC+ flash. For example in an MLC NAND cell leakage can result in 10 being read as 01, but leakage will never result in 01 being read as 10. Meanwhile Reed-Solomon codes were invented for line transmission where 01 has equal odds of being represented as 10 and vice-versa, due to noise affecting the timing of a sine wave crossing a threshold – generally it will move a bit left to right rather than decrementing a pair of bits as in the charge loss in an MLC cell.

    Without understanding how much error recovery is possible, it’s not really possible to accurately estimate the Shannon Limit, and without that it’s difficult to shrewdly budget amounts of redundancy for the error correction subsystems, be it in the ratio of the bandwidth you give to ECC token redundancy, or in the ratio of checksums and parity to block size – to achieve a target cost/reliability – something which early NAND products did a pretty poor job of.

  2. Sorry I should add that since LDPC does a much better job of correcting for the failure mode of MLC+ NAND, it can increase the amount of time before the drive’s controller will need to refresh the block by erasing and re-writing it, and that can increase the longevity of the drive.

    However these days drive manufacturers are aware of the tradeoffs and will simply reduce the manufacturing standards and over-provisioning, with net savings on the device cost, rather than it actually leading to greater longevity.

Comments are closed.