A lot of folks believe that when Intel’s Optane is gone there will be nothing left but the story of its rise and fall. That is far from the truth. Optane has created a legacy of developments that will find use in computing for quite some time.
In this three-part series The SSD Guy blog reviews Optane’s lasting legacy to reveal six changes that it has brought to computing architecture in its short lifetime.
Each of the three parts covers two new developments:
- New programming paradigm & instructions
- New approach to 2-speed memory & latency handling
- New approach to memory expansion & security
This second post will cover Optane’s new approach to 2-speed memory and its approach to Optane’s longer latency.
New Approach to 2-Speed Memory
Optane Media, the 3D XPoint memory that is the basis of Optane DIMMs, takes about 3 times as long as DRAM to read, and perhaps twice as long as that to write. Although that’s enormously fast compared to NAND flash, it presents problems because of the way that the main memory bus has evolved.
Ever since DRAMs became synchronous around the turn of the century, DRAM buses have been designed to run at a uniform speed, with the only difference between one memory module and another determined by how many latency cycles were required for either a read or a write. There was no mechanism to support a slower write than a read, or to support memories of two significantly different latencies.
Rather than slow the entire bus down to a speed that would accommodate the slowest operation on the slowest memory chip, Intel developed the “DDR-T” spin-off of the DDR4 bus. The “T” stands for “Transactional,” commands are issued, and acknowledgements are given later, as in an I/O interface.
Intel hasn’t provided much detail on the DDR-T protocol, since the company makes both the processors and the DIMMs that use it, so there’s no real need for them to disclose anything. Pretty much all that was revealed is in the diagram below:
The only real difference between the signals that go to the DDR4 DIMM on the right and the “Intel Optane DC Persistent Memory” module on the left is the purple line on the far left of the diagram labeled “Modified Control Signals.” These signals are carried on a few normally-unused pins on the standard DDR4 bus.
The importance here is that Intel made the industry aware that this will be an issue in the future, and that a solution is required. For slower Far Memory (memory not directly attached to the processor) Intel’s solution was CXL, while for Near Memory the solution was a DDR-T interface that would need to be reconfigured with every new spin of the DDR interface: DDR4, DDR5, DDR6…
IBM, in their OpenCAPI standard, chose to develop a different approach called OMI that is described in this white paper. Now that CXL has merged with OpenCAPI, OMI could very well become the industry-standard way of attaching near memory of any speed to a processor.
Improved Latency Handling
When Optane was introduced computers handled latencies two ways. Fast latency memories ran on the bus, while longer-latency I/O was managed through interrupts and context switches. A context switch takes a long time as the processor pushes the program counter and a number of internal registers onto the stack at the beginning, and restores all of these at the end of the routine. As a rule of thumb, count on a context switch consuming around 100μs, or about 1,000 times Optane’s 100ns latency.
The impact of this is illustrated in the diagram below (from SNIA). Latency is measured on a log scale on the vertical axis. The background colors represent the latencies where a context switch makes sense and where it doesn’t. In the darker upper portion it makes sense, since it doesn’t add much to the access time of the device. In the lower green section a context switch would dominate the device’s access time, so it would be unacceptable, and polling, where the processor continuously loops waiting for the data to be ready, would be a preferable option. The light band in the middle is a zone in which it’s difficult to tell which option is better.
The columns show the latencies of different storage technologies. Prior to persistent memory it was clear that all persistent storage was slow enough that it could be well managed as I/O, using interrupts which would trigger context switches. With persistent memory (Optane) that changed, and polling became a better choice.
Handling Optane through an interrupt structure would drop its speed by three orders of magnitude. That was clearly unacceptable. Yet, it ran more slowly than standard DRAM main memory, and interrupts were the only way to handle slower devices.
Computer designers had to open their eyes to a new technique based on polling, an approach which had been abandoned for decades. This new way of thinking has now been established, and will provide opportunities to examine fast I/O in ways that haven’t been considered before. With CXL.mem and CXL.cache memory accesses are handled without context switches to allow them to run much faster than I/O transactions.
Coming Up: Memory Expansion & Security Concerns
In the next part of this series we will discuss how Optane has caused computer architects to completely re-think memory expansion, along with the special concerns that have been addressed to prevent data theft via persistent memory.
Keep in mind that Objective Analysis is an SSD and semiconductor market research firm. We go the extra mile in understanding the technologies we cover, often in greater depth than our clients. This means that we know how and why new markets are likely to develop around these technologies. You can benefit from this knowledge too. Contact us to explore ways that we can work with your firm to help it create a winning strategy.
2 thoughts on “Optane’s Legacy, Part II: Two-Speed Memory and Latency Handling”
DDR-T did not use polling. All indications are that it worked a bit like DDR, that the CPU had a state machine predicting how the Optane DIMM would respond and on what schedule, but that it was incompatible with the DDR4 state machine. So the CPU scheduled the data transfers on the half-duplex bus. There may have been some additional signaling to let the CPU know that the DIMM had completed a read, since that timing might vary with ECC activity on the DIMM controller.
In OMI and CXL.mem the bus is truly duplex so there is no need for the CPU to schedule replies. The replies simply queue at the DIMM controller if necessary and are accompnied by transaction IDs so they match up with waiting requests when they arrive at the host.
Polling would be pretty wasteful at this kind of throughput. Even for future SSD traffic the S-IOV mechanism foresees the host writing commands into hardware queue at the devicce while the device DMA’s data from and to the host, while command completions are notified through the coherency mechanism (like shared locks) which does not use interrupts. Those command completions are a form of polling, but the host polls its own memory space to see if anything has been written into the completion. This is how S-IOV will scale to millions of IOs per core.
Thanks, Tanj. You give a much more detailed and thorough explanation.
My goal was to abbreviate the discussion to its simplest element.
Given that it’s not a context switch, I decided to simply say that it is polling, without going into the deeper explanation that you have shared.
Comments are closed.