Using AI to Manage Internal SSD Parameters

Old wise man with a disk symbol coming out of his headFor a long time The SSD Guy has meant to write something about the budding use of AI in SSDs.  It’s an interesting approach whose time has come.

If you’re not conversant with AI, and maybe find the whole subject to be daunting, don’t worry.  AI comes in many forms, and some are very simple.  When major Internet firms like Google and Facebook use AI to target advertising it’s enormously complex, but the AI used in endpoints for the Internet of things (IoT) is often almost trivial.  Those of today’s SSD controllers that use AI tend to use it at a very basic level.  Even so, they get amazing results.

AI solves two problems:

  1. It can manage a large number of variables relatively easily
  2. It’s adaptive

SSDs have a large number of internally-tuned variables that designers use to optimize one specification or another.  For example: Is this SSD aimed at a high read load, or a high write load?  What will be its endurance?  How will latency be impacted by IOPS?  When is the best time to do Garbage Collection?  What are the normal traffic patterns?  The list goes on and on, and every parameter has an impact one or more others.  (Lane Mason calls this: “Squeezing the balloon.”)  AI can help simplify the art of tuning all of these parameters for optimum performance.

The beauty of an SSD being adaptive is that it can tune itself to the unique workload of any system.  All systems’ workloads differ from any other, and the workload will change over time, either over the longer term with software and usage pattern changes, or even from hour to hour as demands vary to accommodate human schedules.  After all, not many people go shopping at 3:00 AM.

A small amount of AI can do a lot to help with these issues.  Let’s look at a couple of examples.  The following companies openly discuss their use of AI in their SSDs.  There are doubtlessly others who keep their use of AI secret.

InnoGrit

In its 2019 Flash Memory Summit keynote speech (Video) InnoGrit introduced an SSD controller named Tacoma that uses AI to tune internal parameters.  (The slides can be downloaded HERE, on the Flash Memory Summit website.)

InnoGrit uses their AI to manage which data should be kept in Storage Class Memory (SCM) and which to move to TLC flash.  (Note that InnoGrit defines SCM to include emerging memory technologies like Intel’s Optane, as well as high-speed NAND flash like Kioxia’s XL-FLASH.  The definition of SCM tends to depend on who’s using the term, a sad fact that is explored in a post by my counterpart The Memory Guy.)

The keynote included benchmarks for a system with 80% TLC NAND for storage and and 20% XL-FLASH as the SCM.  The controller’s internal neural network does a very good job of determining which data is hot and should reside in the SCM.  In a PC user trace it was 99.94% accurate, and in an enterprise system it had an accuracy of 94.72%.  This allowed a user trace to run twice as fast, completing in 90 seconds, compared to 175 seconds in a more conventional SSD without SCM or the neural network.  The same benchmark in an SSD with SCM, but without AI took 124 seconds.

The system takes a little while to learn the data pattern, and that’s apparent in this slide, which was a part of the keynote.  It seems to be a quick learner.

Curve showing the improvement in performance as a function of timeAs the graphic shows, the AI engine identified hot vs. cold data to 85% accuracy within 250 I/O operations, rising to 95% by the time about 3,000 I/O operations have been performed.

DapuStor

The approach used by DapuStor is to analyze real-world SSD workloads on a server and apply these usage patterns to the learning inside the DapuStor SSD.  The real-world workloads have been captured and processed by Calypso Systems, who has been doing this work for a number of years in conjunction with the Storage Networking Industry Association – SNIA.

DapuStor’s system uses only four inputs that it feeds into a 3-layer recurrent neural network with 128 nodes for each internal layer.  The four inputs are:

  1. I/O Address
  2. I/O Length
  3. Read/Write Ratio
  4. Read/Write Interval

Sketch of three-layer neural networkThe company reasons that, since I/O patterns represent a time series, then the recurrent approach, which feeds past results back into the neural network, is a good fit.  This approach creates a sliding window that predicts upcoming I/O patterns.

For the sake of speed, the neural network is currently implemented in hardware in an FPGA, with an ASIC planned for the future.  With this modest amount of AI, DapuStor claims a prediction accuracy better than 95%, to drive performance 20% above that of a similar SSD without AI.

You can download a copy of a presentation that DapuStor and Calypso gave at SNIA’s Persistent Memory Summit HERE, or watch a video HERE. (This link skips straight to DapuStor’s 12-minute portion, which starts a little past the middle of the presentation.)

Enmotus

While DapuStor and Innogrit have built SSDs based on XL-FLASH paired with TLC flash, Enmotus is using standard QLC and SLC flash that is managed by AI.  The company’s FuzeDrive, which is pitched to the gaming community, is said to stay fast even as it is filled.

One big difference between Enmotus and the other two is that the AI management engine resides in the divers in the host.  The company made a presentation about the FuzeDrive at the 2020 Flash Memory Summit which can be downloaded HERE.

The SLC flash in the Enmotus drive is really QLC that is being used as SLC, so it has SLC speed and endurance, but should cost less than the XL-FLASH used in the two SSDs mentioned above.  Most QLC SSDs reassign a small region in their NAND arrays to be used as an SLC cache, but the Enmotus SSD presents the SLC in a host-accessible region, allowing the host, rather than the SSD controller, to control which data is placed in SLC, and which in QLC.  This is illustrated by the diagram below.

Chart showing applications programs at the top, an SLC + QLC SSD at the bottom, and an AI layer managing data placement

Enmotus says that AI-based SSDs maintain consistent performance by remapping hot data to SLC in real time, while standard SSD caches slow down as the drive is filled with data.  (I asked Enmotus about this, since the speed of an SSD is highly dependent on its level of overprovisioning.  The explanation is that the large SLC area reduces QLC writes so dramatically that very little delay actually comes from writes to the QLC.)

The company benchmarked the FuzeDrive against a number of TLC SSDs and says that the PCMark 10 Quick Storage Test runs as much as 82% faster than these others.  SLC also provides better wear stats than TLC or QLC, and the number of writes that actually make it through to QLC is reduced through this architecture, so the FuzeDrive provides greater endurance than you would expect from a QLC SSD.  The company takes advantage of this higher endurance by offering three product levels, Gold, Silver, and Bronze, that provide three levels of endurance.

Understanding the Technology

My company, Objective Analysis, provides market research and consulting services for markets that we deeply understand.  We’re a lot more technical than many competing firms.  You can use this to your company’s strategic advantage.  To explore ways to use our strengths to benefit your efforts, please contact us via the Objective Analysis website.

 

 

 

 

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.