Why is my NVMe slower than my SSD?



In this article, we will look at some of the nuances of the I / O subsystem and their impact on performance.



A couple of weeks ago, I ran into the question why NVMe on one server is slower than SATA on another. I looked at the characteristics of the servers and realized that it was a trick question: NVMe was from the user segment, and the SSD was from the server segment.



Obviously, comparing products from different segments in different environments is incorrect, but this is not an exhaustive technical answer. Let's learn the basics, experiment and answer the question.



What is fsync and where is it used



To speed up work with drives, data is buffered, that is, stored in volatile memory until a convenient opportunity presents itself to save the contents of the buffer to the drive. The "opportunity" criteria are determined by the operating system and the characteristics of the drive. In the event of a power failure, all data in the buffer will be lost.



There are a number of tasks in which you need to make sure that changes in a file are written to the drive, and not in an intermediate buffer. This confidence can be gained by using the POSIX-compliant fsync system call. The fsync call initiates a forced write from the buffer to the drive.



Let's demonstrate the effect of buffers with an artificial example in the form of a short C program.



#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>

int main(void) {
    /*   answer.txt  ,    --  */
    int fd = open("answer.txt", O_WRONLY | O_CREAT);
    /*     */
    write(fd, "Answer to the Ultimate Question of Life, The Universe, and Everything: ", 71);
    /*  ,      10  */
    sleep(10);
    /*    */
    write(fd, "42\n", 3); 

    return 0;
}


The comments explain well the sequence of actions in the program. The text "the answer to the main question of life, the Universe and all that" will be buffered by the operating system, and if you restart the server by pressing the Reset button during "calculations", the file will be empty. In our example, text loss is not a problem, so fsync is not needed. Databases do not share this optimism.



Databases are complex programs that simultaneously work with many files, so they want to be sure that the data they write will be saved on the drive, since the consistency of data within the database depends on it. The databases are designed to record all completed transactions and be ready for power outages at any time. This behavior obliges us to use fsync in large quantities all the time.



What Frequent Use of fsync Affects



With normal I / O, the operating system tries to optimize its communication with disks, since external drives are the slowest in the memory hierarchy. Therefore, the operating system tries to write as much data as possible in one call to the drive.



Let's demonstrate the impact of using fsync with a specific example. We have the following SSDs as test subjects:



  • Intel® DC SSD S4500 480 GB, connected via SATA 3.2, 6 Gb / s;
  • Samsung 970 EVO Plus 500GB, PCIe 3.0 x4, ~ 31 Gbps.


Tests are conducted on Intel® Xeon® W-2255 running Ubuntu 20.04. Sysbench 1.0.18 is used to test disks. One partition is created on the drives, formatted as ext4. Preparation for the test consists of creating 100 GB files:



sysbench --test=fileio --file-total-size=100G prepare


Running tests:



#  fsync
sysbench --num-threads=16 --test=fileio --file-test-mode=rndrw --file-fsync-freq=0 run

#  fsync   
sysbench --num-threads=16 --test=fileio --file-test-mode=rndrw --file-fsync-freq=1 run


The test results are presented in the table.

Test Intel® S4500 Samsung 970 EVO +
Reading without fsync, MiB / s 5734.89 9028.86
Recording without fsync, MiB / s 3823.26 6019.24
Read with fsync, MiB / s 37.76 3.27
Fsync recording, MiB / s 25.17 2.18
It is easy to see that NVMe from the client segment is confidently leading when the operating system itself decides how to work with disks, and loses when fsync is used. This raises two questions:



  1. Why in the test without fsync the read speed exceeds the physical bandwidth?
  2. Why is a server-side SSD better at handling a large number of fsync requests?


The answer to the first question is simple: sysbench generates files filled with zeros. Thus, the test was carried out over 100 gigabytes of zeros. Since the data is very monotonous and predictable, various OS optimizations come into play, and they significantly speed up execution.



If you question all the sysbench results, then you can use fio.



#  fsync
fio --name=test1 --blocksize=16k --rw=randrw --iodepth=16 --runtime=60 --rwmixread=60 --fsync=0 --filename=/dev/sdb

#  fsync   
fio --name=test1 --blocksize=16k --rw=randrw --iodepth=16 --runtime=60 --rwmixread=60 --fsync=1 --filename=/dev/sdb
Test Intel® S4500 Samsung 970 EVO +
Reading without fsync, MiB / s 45.5 178
Recording without fsync, MiB / s 30.4 119
Read with fsync, MiB / s 32.6 20.9
Fsync recording, MiB / s 21.7 13.9
The tendency for NVMe performance to drop when using fsync is clearly visible. You can proceed to the answer to the second question.



Optimization or bluff



Earlier we said that the data is stored in a buffer, but we did not specify which one, since it was not important. We will not go deep into the intricacies of operating systems now and highlight two general types of buffers:



  • program;
  • hardware.


The software buffer refers to the buffers that are in the operating system, and the hardware buffer refers to the volatile memory of the disk controller. The fsync system call sends a command to the drive to write data from its buffer to the main storage, but it cannot control the correctness of the command execution.



Since SSDs perform better, two assumptions can be made:



  • the disk is designed for such a load;
  • the disk bluffs and ignores the command.


You can see dishonest behavior of the drive if you run a power failure test. You can check this using the diskchecker.pl script , which was created in 2005.



This script requires two physical machines - "server" and "client". The client writes a small amount of data to the disk under test, calls fsync, and sends the server information about what was written.



#   
./diskchecker.pl -l [port]

#   
./diskchecker.pl -s <server[:port]> create <file> <size_in_MB>


After running the script, it is necessary to de-energize the "client" and not return power for several minutes. It is important to disconnect the person being tested from electricity, and not just perform a hard shutdown. After some time, the server can be connected and loaded into the OS. After the OS boots, you need to run diskchecker.pl again , but with the verify argument .



./diskchecker.pl -s <server[:port]> verify <file>


At the end of the check, you will see the number of errors. If there are 0, then the disk has passed the test. To exclude a combination of circumstances successful for the disc, the experiment can be repeated several times.



Our S4500 showed no power loss errors, so it can be argued that it is ready for loads with a lot of fsync calls.



Conclusion



When choosing disks or complete ready-made configurations, you should remember the specifics of the tasks that need to be solved. At first glance, it seems obvious that NVMe, that is, SSD with PCIe-interface, is faster than the "classic" SATA SSD. However, as we understood today, in specific conditions and with certain tasks, this may not be the case.



How do you test server components when renting from an IaaS provider?

We are waiting for you in the comments.






All Articles