NVMe-oC: Wolley’s New Take on CXL-Based SSDs

Vertical block diagram with a CPU at the top, a CXL link below, and a dashed-line box below that. The Dashed box contains two side-by-side devices: NVMe NAND and Memory. Two bidirectional arrows communicate from these two devices through the CXL link to the CPU. A little red arrow shows data moving directly from the NAND into the memory while bypassing CXL.With all the recent interest in CXL, and its ability to connect a processor to any memory, no matter the speed, it’s only natural that someone would try using it for SSDs.  This notion is the basis for the Memory-Semantic SSD, or MS-SSD.

But MS-SSDs suffer from the same problem as SSDs, hard drives, and other mass storage.  The basic concept requires for the SSD to try to anticipate the processor’s upcoming requirements.  If it guesses correctly, then the SSD can perform the processor’s next operation rapidly, but if it guesses wrong then that operation will be slow.

In the case of the MS-SSD, the device must anticipate the next several addresses that the processor will read from, and load them from the SSD into the MS-SSD’s DRAM.

What if the MS-SSD instead would let the processor tell it what it would need for the next several memory cycles, allowing the MS-SSD to prefetch exactly what the processor was about to ask for?  This is the same thinking that has been applied to many new non-CXL SSD architectures like the Open-Channel SSD.

CXL controller design house Wolley decided to apply this thinking, and presented the company’s new architecture at November’s Super Computing conference SC23.

They call it “NVMe over CXL” or NVMe-oC and say that it’s:

an implementation of CXL to optimize the host-device data movement where most hosts only use a fraction of data retrieved from the storage devices.

The basic idea is that many SSD reads are for smaller chunks of data than the standard 4KB delivered by an SSD access.  Why move all that data over CXL-io or NVMe over PCIe if the processor only needs one 64-byte cache line of it?  NVMe-oC is expected to both reduce the I/O traffic and the effort spent by the host to move the data itself.

The approach uses CXL.io to access the SSD and CXL.mem to access the memory.  Special commands tell the SSD to write data into that memory or to write from the memory into the SSD, without any interaction from the host, to reduce host-device data movement.  A block diagram appears in a thumbnail at the top of this post and full sized below:

Vertical block diagram with a CPU at the top, a CXL link below, and a dashed-line box below that. The Dashed box contains two side-by-side devices: NVMe NAND and Memory. Two bidirectional arrows communicate from these two devices through the CXL link to the CPU. A little red arrow shows data moving directly from the NAND into the memory while bypassing CXL.

In a way, this approach establishes a halfway point between two CXL-attached devices that have been in discussion since the early days of CXL: Standard CXL-attached memory, and CXL-attached computational memory modules.  While the CXL-attached computational memory modules offload certain computation tasks into the CXL module, the NVMe-oC device offloads a much more basic task, simple data movement, away from the host and into the module.  This approach simplifies adoption by reducing the amount of application software re-work necessary to take advantage of the technology.  In fact, Wolley claims that the NVMe-oC device can accelerate I/O virtualization on Virtio without requiring any changes to the application software.  All that is necessary is a special NVMe-oC driver.

Wolley has built a prototype using an FPGA that contains a CXL controller, an NVMe controller for the NAND, DDR controllers for the memory, and an NVMe-oC bridge to manage the new data movement functions.

The SSD Guy is always fascinated by modest changes that solve significant problems, and this new architecture does a good job of that.  As data centers shift from conventional computing to more AI, data movement becomes a bigger and bigger issue.  This SSD’s data movement capabilities should dramatically reduce the amount of unnecessary data that is moved to the host as 4KB blocks, rather than 64-byte memory transfers, while offloading wasteful data movement tasks from the host processor.

 

 

2 thoughts on “NVMe-oC: Wolley’s New Take on CXL-Based SSDs”

  1. Thank you very much for the nice write up. It turns out there is a much more direct comparison between NVMe-oC and MS-SSD. I’m writing a white paper on it now and will be sure to share it with you.

    You were correct that the core value of MS-SSD is caching (or pre-fetching) algorithms. It turns out one way to look at NVMe-oC is that we are allowing the host to “manage” the HDM directly. For example, when the host askes the device to move a 4KB from flash and place it at a particular location inside the HDM (remember the host has full control of it using the conventional NVMe protocol), it is basically a “cache evicting” algorithm in play. If we want to have a fair comparison between NVMe-oC and MS-SSD, it will be one that has the same size of Flash and HDM. In this case, it basically boils down who is doing the caching algorithm, host (NVMe-oC) or device (MS-SSD).

    If the history teaches us anything, it is that host-side algorithms were better. You should still remember Open Channel SSD, where some people proposed to move the Flash translation layer to the host, with the idea that the host just knows a lot more that the device cannot and will not know. From this angle, NVMe-oC is not only more “evolutionary” (aligned with NVMe protocol except to assign destination locations away from host DDR memory), but it may outperform MS-SSD with a host-blessed caching algorithm. (To be fair, when the device performs the caching, it is internal; while the host performs the caching, it is using NVMe commands across the PCIe bus to make it happen. Mechanically device-side caching may be more efficient; but intelligence wise it is a totally different story.)

    There is actually another big disadvantage of MS-SSD when cache misses occur. For NVMe-oC, if we are using the view that host is managing the HDM as a cache, then upon cache misses, the host will just issue another NVMe command to fetch the needed 4KB from Flash. When this happens, the application thread will be swapped out, and will eventually be swapped back in when the device interrupts the host when the NVMe command is finished. This is just IO operation 101. Now see what happens when MS-SSD hits a cache miss. The host just sends one CXL.mem command to the device and the application thread may be idle waiting for the data for 60+ microseconds! We were the SCM people so we were counting on the SCM latency to be worse but still comparable to DRAM, and such a busy waiting may be acceptable. But I doubt if it is a good approach for NAND…

    Anyway, I think this is just the beginning of an interesting journey. Your insight and help are very much appreciated here. We will share any additional information when it is available.

  2. Just today I learned that Samsung has renamed its memory-semantic SSD (MS-SSD) to CMM-H, for CXL memory module-hybrid.

    It’s the same device, but it will be marketed under this new name.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.