How Controllers Maximize SSD Life – Feedback on Block Wear

Tempus FugitOne way that SSD controllers maximize the life of an SSD is to use feedback on the life of flash blocks to determine how wear has impacted them.  Although this used to be very uncommon, it is now being incorporated into a number of controllers.

Here’s what this is all about: Everybody knows that endurance specifications tell how much life there is in a block, right?  For SLC it is typically 100,000 erase/write cycles, and for MLC it can be as high as 10,000 cycles (for older processes) but goes down to 5,000 or even 3,000 for newer processes.  TLC endurance can be in the hundreds of cycles.  Now the question is: “What happens after that?”

In most cases individual bits start to stick after that.  How long after that, though, is not fully specified.  In a SNIA white paper that The SSD Guy helped to write, Fusion-io shared some chip characterization the company performed that showed SLC flash lasting sometimes over 4 million cycles before bits started to stick.  (See Figure 2, which shows negligible bit errors below 2 million cycles for SLC NAND.)

If NAND can be programmed and erased so very many times, then why don’t chip makers specify the greater number?  Wouldn’t this give them a competitive advantage?  In fact, for the vast majority of their market – consumer electronics – a better wear specification would not help sell flash chips, and it would be very costly to specify it accurately.  Endurance tests are very time consuming and unfortunately they destroy the chip they are testing.  Add to this the fact that endurance can vary by chips within a single wafer, and the result is that you can’t perform lot characterization efficiently, and even if you could, it would create an inventory situation as the chip maker delayed shipping a production batch of chips until the wear testing had been completed.

The answer is to set the bar low enough that every part ever made will pass.  In the case of SLC flash this bar is set at 100,000 cycles.  For MLC it is set much lower, and for TLC it can be an order of magnitude lower than for MLC.

So, how can this bit of information be used to extend SSD life?

Let’s say that your controller can figure out how many erase/write cycles every flash block in the SSD can withstand.  Then it can manage the wear leveling algorithm to put more wear onto the hardy blocks and less onto the weaker blocks.  More importantly, though, the controller will not simply give up and say: “100,000 erase writes on this block – time to shut it off!”  Instead it will take the block right to the point where the bit errors are nearly the maximum number the error correction algorithm can correct.

Naturally, this will extend the life of each NAND block to its absolute limit, most likely multiplying the life of the SSD to many times that of an SSD without this feedback.

It’s not a complicated idea, but one that didn’t occur to early SSD designers.  It’s the kind of simple, yet highly-effective approach that I would expect to come into universal use over time.

This post is part of a series published by The SSD Guy in September-November 2012 to describe the leading methods SSD architects use to get the longest life out of an SSD despite the limited number of erase/write cycles that NAND flash specifications guarantee.  The following list provides the names of all of these articles, and hot links to them:

Click on any of the above links to learn about how each of these techniques works.

Alternatively, you can visit the Storage Networking Industry Association (SNIA) website to download the entire series as a 20-page booklet in pdf format.

6 thoughts on “How Controllers Maximize SSD Life – Feedback on Block Wear”

  1. This seems to be a bad idea because it ignores a major factor: time endurance. After the erase cycles are consumed, the flash should be able to store data for a specified number of years (10 usually) at a certain temperature with a certain BER. As the cycles increase beyond that point, the time endurance will drop, long before catastrophic failure occurs.

    If you push the limit like this, you may store data on a SSD and come back a year later to find it corrupted.

    1. Sam, That is quite correct, but I don’t really know of any applications in which data is expected to remain good for all that long.

      I know that some folks may want to keep old unused PCs around, but PCs are a low-write application, so the data should stay intact for a long while.

      The applications that are likely to suffer from short endurance because of high write loads are in the enterprise, and I can’t think of an application case in which an SSD in the enterprise would be powered down for a number of years.

Comments are closed.