Test Setup - Enterprise NVMe Round-Up 2: SK Hynix, Samsung, DapuStor and DERA

Posted by Reinaldo Massengill on Wednesday, March 13, 2024

Test Setup

For this year's enterprise SSD reviews, we've overhauled our test suite. The overall structure of our tests is the same, but a lot has changed under the hood. We're using newer versions of our benchmarking tools and the latest longterm support kernel branch. The tests have been reconfigured to drastically reduce CPU overhead, which has minimal impact on SATA drives but lets us properly push the limits of the many enterprise NVMe drives for the first time.

The general philosophy underlying the test configuration was to keep everything at its default or most reasonable everyday settings, and change as little as possible while still allowing us to measure the full performance of the SSDs. Esoteric kernel and driver options that could marginally improve performance were ignored. The biggest change from last year's configuration and away from normal everyday usage is in the IO APIs used by the fio benchmarking tool to interact with the operating system.

In the past, we configured fio to use ordinary synchronous IO APIs: read() and write() style system calls. The way these work is that the application makes a system call to perform a read or write application, and control transfers to the kernel to handle the IO. The application thread is suspended until that IO is complete. This means we can only have one outstanding IO request per thread, and hitting a drive with a queue depth of 32 requires 32 threads. That's no problem on a 36-core test system, but when it takes a queue depth of 200 or more to saturate a high-end NVMe SSD, we run out of CPU power. Running more threads than cores can get us a bit more throughput than just QD36, but that causes latency to suffer not just from the overhead of each system call, but from threads fighting over the limited number of CPU cores. In practice, this testbed is limited to about 560k IOPS when performing IO this way, and that leaves no CPU time for doing anything useful with the data that's moving around. Spectre, Meltdown and other vulnerability mitigations tend to keep increasing system call and context switch overhead, so this situation isn't getting any better.

The alternative is to use asynchronous storage APIs that allow an application thread to submit an IO request to the operating system but then continue executing while the IO is performed. For benchmarking purposes, that continued execution means the application can keep submitting more IO requests before the first one is complete, and a single thread can load down a SSD with a reasonably high queue depth.

Asynchronous IO presents challenges, especially on Linux. On any platform, asynchronous IO is a bit more complicated for the application programmer to deal with, because submitting a request and getting the result become separate steps, and operations may complete out of order. On Linux specifically, the original async IO APIs were fraught with limitations. The most significant is that Linux native AIO is only actually asynchronous when IO is set to bypass the operating system's caches, which is the opposite of what most real-world software should want. (Our benchmarking tools have to bypass the caches to ensure we're measuring the SSD and not the testbed's 192GB of RAM.) Other AIO limitations include support for only one filesystem, and myriad scenarios in which IO silently falls back to being synchronous, unexpectedly halting the application thread that submitted the request. The end result of all those issues is that true asynchronous IO on Linux is quite rare and only usable by some applications with dedicated programmers and competent sysadmins. Benchmarking with Linux AIO makes it possible to stress even the fastest SSD, but such a benchmark can never be representative of how mainstream software does IO.

The best way to set storage benchmark records is to get the operating system kernel out of the way entirely using a userspace IO framwork like SPDK. This eliminates virtually all system call overhead and makes truly asynchronous IO possible and fast. It also eliminates the filesystem and the operating system's caching infrastructure and makes those the application's responsibility. Sharing a SSD between applications becomes almost impossible, and at the very least requires rewriting both applications to use SPDK and overtly cooperate in how they use the drive. SPDK works well for use cases where a heavily customized application stack and system configuration is possible, but it is no more capable of becoming a mainstream solution than Linux AIO.

A New Hope

What's changed recently is that Linux kernel developer (and fio author) Jens Axboe introduced a new asynchronous IO API that's easy to use and very fast. Axboe has documented the rationale behind the new API and how to use it. In summary: The core principle is that communication between the kernel and userspace software takes place with a pair of ring buffers, so the API is called io_uring. One ring buffer is the IO submission queue: the application writes requests into this buffer, and the kernel reads them to act on. The other is the completion queue, where the kernel writes notification of completed IOs, which the application is watching for. This dual queue structure is basically the same as how the operating system communicates with NVMe devices. For io_uring, both queues are mapped into the memory address spaces of both the application and the kernel, so there's no copying of data required. The application doesn't need to make any system calls to check for completed IO; it just needs to inspect the contents of the completion ring. Submitting IO requests involves putting the request in the submission queue, then making a system call to notify the kernel that the queue isn't empty. There's an option to tell the kernel to keep checking the submission queue as long as it doesn't stay idle for long. When that mode is used, a large number of IOs can be handled with an average of approximately zero system calls per IO. Even without it, io_uring allows for IO to be done with one system call per IO compared to two per IO with the old Linux AIO API.

Using synchronous IO, our enterprise SSD testbed cannot reach 600k IOPS. With io_uring, we can do more than 400k IOPS on a single CPU core without any extra performance tuning effort. Hitting 1M IOPS on a real SSD takes at most 4 CPU cores, so even the Micron X100 and upcoming Intel Alder Stream 3D XPoint SSDs should pose no challenge to our new benchmarks.

The first stable kernel to include the io_uring API was version 5.1 released in May 2019. The first long term support (LTS) branch with io_uring is 5.4, released in November 2019 and used in this review. The io_uring API is still very new and not used by much real-world software. But unlike the situation with the old Linux AIO APIs or SPDK, this seems likely to change. It can do more than previous asynchronous IO solutions, including being used for both high-performance storage and network IO. New features are arriving with every new kernel release; lots of developers are trying it out, and I've seen feature requests fulfilled in a matter of days. Many high-level languages and frameworks that currently simulate asynchronous IO using thread pools will be able to implement new io_uring backends.

For storage benchmarking on Linux, io_uring currently strikes the best balance between the competing desires to simulate workloads in a realistic manner, and to accurately gauge what kind of performance a solid state drive is capable of providing. All of the fio-based tests in our enterprise SSD test suite now use io_uring and never run more than 16 threads even when testing queue depths up to 512. With the CPU bottlenecks eliminated, we have also disabled HyperThreading.

ncG1vNJzZmivp6x7orrAp5utnZOde6S7zGiqoaenZH52gJhqZp6mpJq%2Fsb7IrJxmpqaism602KegsWWjlrq0wc2gZJ2ZoKrAtbvRZpueqpFkgQ%3D%3D