What is Write Amplification?

A 4x4 checkerboard of red, green, and white squaresFor a long time, The SSD Guy has been talking about Write Amplification without explaining what It is.  This post is intended to fix that.

Write amplification is an internal issue for NAND flash SSDs that arises from the way that NAND chips work.  It doesn’t exist in standard HDDs, nor did it exist in DRAM SSDs before we had NAND ones.  In a nutshell it’s the number of times that a NAND chip within an SSD is written to after the host computer performs a single write.

Since NAND flash wears out after too many writes the higher the write amplification, the shorter the SSD will last.  Write amplification also increases the internal write traffic within the SSD, and that slows the SSD down.

Numerically it’s referred to as the Write Amplification Factor, sometimes abbreviated as “WAF”.  An ideal SSD has a WAF of 1.0.  Early SSDs had WAFs of 2.5 or even higher.  Oddly enough, compressed SSDs sometimes had WAFs below 1 – a single write to the SSD would cause fewer than one write to the NAND within the SSD.  That may sound impossible, but I’ll explain how it’s done at the very end of this post.

So a write amplification of 1 means that every write issued by the host computer causes a single write to the NAND flash.  A write amplification greater than 1 means that more than one NAND write results from every write from the host computer.  That’s not intuitive.  Let’s see why that would happen.

I’ll use a highly graphical approach to make it easier to absorb.

NAND Flash’s Inner Quirks

Write amplification, as well as a few other SSD headaches (like garbage collection and non-deterministic timing), stems from the way that NAND flash chips work.  You can’t understand write amplification unless you understand NAND’s internal structure and its erase-before-write requirement.

So here’s a brief description of these characteristics.

A NAND flash chip is a collection of Blocks.  Each Block is made up of a number of 4kB Pages.  This is roughly sketched out in the diagram below.

Blocks representing a NAND flash chip. There are four large boxes, each made up of 4x4 smaller boxes. The smaller boxes are labeled as pages, and the larger boxes are labeled as blocks. The whole schlemiel is labeled as a NAND flash chip.

This isn’t a real chip.  It isn’t even close!  This chip has four blocks, each with sixteen pages, for a grand total of 64 pages.  A real 128Gb NAND flash chip, a density prevalent at the writing of this post, has 8,192 blocks of 128 pages each.  That’s too many for a simple sketch, so we’ll used the slimmed-down example here.

Now that you understand the layout, let’s look at those quirks.

First, NAND and NOR flash cannot modify bits in an already-written byte.  Whereas a DRAM byte that contains 1010 1111 can be over-written with 1111 1010, flash must be erased to an all-ones state: 1111 1111, before it can be written to something else.  You can only write a bit from a 1 to a 0, but not the other way around.

Second, you can’t erase a single byte, or even a single 4kB page.  Flash keeps its cost down by only allowing blocks to be erased.  A block is a LOT of data, so you don’t want to erase the entire block if it contains a lot of useful data that you would have to put back into it after the erase.  (In that real-life 128Gb chip, a block contains 128 pages, each with 4K bytes, adding up to a half of a megabyte.)

Third, since flash wears out if you write to it too many times, you want to manage it in a way that minimizes writes.  Plus, writes are really slow, so the fewer writes you do, the faster the SSD works.

Given all of this, flash controllers in all applications, SSDs, cell phones, USB flash drives, flash cards, etc., all map pages to addresses – when the host tells a flash-based device to write to address XYZ that write can actually take place at any erased page on the chip according to the controller’s preferences.  When that page is to be over-written, rather than erasing the entire block, the controller simply invalidates the page that used to hold address XYZ and writes a new version of address XYZ to a different page.  The controller keeps track of which page is mapped to which address, so that a read from page XYZ will return data from whichever page holds the current value of address XYZ.

An added advantage of this approach is that address XYZ’s write traffic gets moved around on the chip, helping to prevent one page from getting worn out before the others, even if address XYZ is the one that receives the vast majority of a program’s write activity.

The controller also keeps track of the pages that have been invalidated.  One simple rule is that a subsequent write to a certain address should invalidate the page that held the prior data for that same address.  In other words, a new write to address XYZ will not only cause data to be written to a new page, but it will invalidate the page that contains the old data for address XYZ.  Other invalidated pages are more difficult for the controller to understand, which led to the development of the Trim command, which is explained in another post in the SSD Guy blog.

We can see how write amplification happens by graphically applying the explanations above to a simple example.

An SSD Example

Let’s show how write amplification occurs by stepping through the write activity of a very simplistic SSD.

In this example we will apply a simple rule: Whenever possible the controller writes to the flash sequentially, one block at a time, running through the pages from left to right, top to bottom, like this:

One of the "Block" boxes from the previous diagram, with arrows showing the path of writes, from the upper left box/page across the top row, to the next-lower row, going again from left to right, and so on.

Fresh Out-of-the-Box State

Our controller is shipped to you, the customer, with all of its blocks erased.  For this series of graphics an erased page is white, a page with valid data is green, and a page whose data has been invalidated is red.

The four blocks from the first diagram. All the page boxes within the four blocks are white, indicating that they are erased.

The host starts to write data, and the controller maps it to sequential pages in the first block, left to right, top to bottom, like this.

The same four blocks. The first box has all pages colored green, and the second box has six green pages. All other pages remain white.

SSD Fills with Valid Data and Invalid Pages

Over time, more and more pages are filled, but some of the writes are new data to an address that is already represented in flash (like address XYZ above), so the old pages get marked as invalid (red) so that they can eventually be reclaimed.

The four blocks again, now with the third block mostly filled with green pages. Some of the pages are now red, indicating invalid data. Only the fourth block is still all white.

Still later, the last block begins to fill up.  Pretty soon the chip will run out of erased blocks.  Since it can’t over-write the old invalid data in the red blocks, a block will need to be erased to take the new data.  How can we do that?

The same four blocks, with the fourth block half-filled. More blocks have gone from green to red, and now only five of the first block's 16 pages remain green. Eight white erased pages remain in the last block.

Preserving Valid Data

As long as there are more free pages in the last block than there are valid pages in one of the other blocks, then the valid data can be copied from an older block into the remaining empty pages to free up the older block.

Same as the previous graphic, but with arrows from the 5 green pages in the first block to 5 of the 8 remaining erased pages in the last block.

The controller has the responsibility of making sure that there are always enough erased pages to move data from the least-valid block to a block with spare erased pages.

Now the block on the upper left can be erased, preparing it to accept new data.

The same diagram, but now all but two of the last block's pages are colored, and the first block is white, to indicate that it has been erased.

This is all very nice and tidy, but look at what we just did.  In order to free up the block we performed five new writes!  If this happens every time we write to all 64 pages, then the 64 page writes result in 64 + 5 writes, for a write amplification factor of almost 1.1.  Each write causes 1.1 internal writes.

It’s actually more likely that, once the SSD becomes relatively full, five new writes would be required every time a new block fills up, making the amplification four times as large, or 64 + 20 writes, for a write amplification factor of 1.3125.  That’s nearly four internal writes for every three writes issued by the host.  Write amplification thus slows the SSD while increasing its wear.

In real life things are more complicated, and sometimes pages from multiple blocks are copied to other unused pages.  If the SSD becomes too full, then such management is performed very frequently, and the WAF goes sky high.

Overprovisioning

One way to address this problem is to make sure that the amount of flash that the host sees in the SSD is much smaller than the amount of flash that actually exists.  That way the SSD never gets too full, and management is simpler.  This is called overprovisioning.

Overprovisioning increases the amount of NAND flash within an SSD.  Although this drives the SSD’s cost up, it reduces wear and accelerates performance.  Higher-end SSDs tend to use more overprovisioning than less costly SSDs, and that means that they cost more but perform better and last longer.

There’s a post on the SSD Guy blog that explains overprovisioning in more depth: How Controllers Maximize SSD Life – Over Provisioning

What About WAFs of Less than 1?

Some SSDs boast write amplification factors lower than 1.0.  That means that the SSD’s flash is written to fewer times than the host writes to it.

While write coalescing might help in this direction by capturing a few writes before they make it into flash, particularly if the software frequently updates the same page in rapid succession, the most efficient way to achieve a sub-1.0 WAF is to use data compression.  If every host write to the SSD is compressed before it is written to the NAND flash, then the NAND flash sees fewer writes than the host issued, resulting in fewer than one NAND write per host write.  It doesn’t take much compression to offset the write amplification created by NAND’s internal quirks.

This is an area that was pioneered by an SSD controller company named SandForce that was subsequently acquired by LSI Logic, whose SSD controller business was, in turn, acquired by Seagate.

Compression reduces the size of the data that is being written to the SSD’s internal NAND flash.  Some time ago someone from SandForce told me that HTML code can be compressed, on average, to only 17% of its original size.

A write amplification lower than 1.0 means that SSDs with compression might be able to get away without any overprovisioning, and that they should perform faster and wear less than SSDS whose data is uncompressed.

Compression is not only used by SandForce.  Some of IBM’s captively-produced SSDs compress their data, as do SSDs of other highly-reputable firms.

You’re probably now asking yourself: “Why aren’t all SSDs compressed?”  The simple reason is that compression turns data into irregular sizes.  Most SSD controllers efficiently map regularly-sized data writes into regularly-sized pages.  Once those regularly-sized writes are compressed into chunks of variable sizes, it’s very challenging to arrange them to fit efficiently into the flash’s regularly-sized pages.  This is a pretty big challenge that most SSD controller makers choose not to take on.

That’s All There Is To It!

By now you should have a good feel for write amplification and the write amplification factor “WAF”: You know how it happens and what is done to address it.  But unless you’re designing an SSD controller or the firmware for an SSD controller you’re unlikely to need this understanding.

Still, it’s good to be familiar with a term like this that’s in relatively common use by SSD manufacturers and designers.

And it gives you the ability to ask informed questions whenever someone mentions write amplification as a reason for some SSD to out-perform another.

We at Objective Analysis take the time to understand the products that we track, so that we can provide solid answers about the success and failure of various products and companies.  You can take advantage of this understanding by contacting us through our web page at www.Objective-Analysis.com.

 

One thought on “What is Write Amplification?”

  1. Recently, shingled HDDs have write amplification too, since the shingles must be block erased before reuse.

    Compression is used sometimes to get more overprovisioning. The problems with compression which caused it to disappear from the market are:
    – it cannot be guaranteed, so it is tough to market any claims of advantage. If you can’t upsell it, why try?
    – The largest files tend to be multimedia which is already compressed
    – It made the indirection layer more complex and that was already the hardest thing to make bug-free.

    The best way to minimize WAF, oddly, is to think how to ensure all the content in a block finishes (not starts) together. An indirect way of doing this resulted in the “streams” approach where the OS or application is invited to use its knowledge of the data to write data of different durability to different “streams”, which are mostly a hint to the drive how to group them together to maximize having blocks which empty out at the same time. It also allows wear balancing decisions (let a heavily worn block be preferred for reuse by a slowly changing stream, and vice versa). SSD firmware is great fun!

Comments are closed.