Why Linux uses a swap file, part 2

The first part of a little "ripping off" about the work of the virtual memory subsystem, the connection of mmap mechanisms, shared libraries and caches caused such a heated discussion that I could not refrain from continuing the research in practice.



Therefore, today we will do ... In the form of a tiny C program that we write, compile and test in action - with and without a swap.



The program does a very simple thing - it requests a large chunk of memory, accesses it and actively works with it. In order not to worry about loading any libraries, we will simply create a large file that will be mapped into memory the way the system does when loading shared libraries.



And we simply emulate the call of the code from this "library" by reading from such a mmap file.



The program will make several iterations, at each iteration it will simultaneously access the “code” and one of the sections of a large data segment.



And, in order not to write unnecessary code, we will define two constants that will determine the size of the "code segment" and the total size of the RAM:



  • MEM_GBYTES - the size of the RAM for the test
  • LIB_GBYTES - "code" size


The amount of "data" we have is less than the amount of physical memory:



  • DATA_GBYTES = MEM_GBYTES - 2


The total amount of "code" and "data" is slightly larger than the amount of physical memory:



  • DATA_GBYTES + LIB_GBYTES = MEM_GBYTES + 1


For a test on a laptop, I took MEM_GBYTES = 16, and got the following characteristics:



  • MEM_GBYTES = 16
  • DATA_GBYTES = 14 - means "data" will be 14GB, that is "enough memory"
  • Swap size = 16GB


Program text



#include <sys/mman.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
 
#define GB              1073741824l
 
#define MEM_SIZE        16
#define LIB_GBYTES      3
#define DATA_GBYTES     (MEM_SIZE - 2)
 
long random_read(char * code_ptr, char * data_ptr, size_t size) {
   long rbt = 0;
   for (unsigned long i=0 ; i<size ; i+=4096) {
       rbt += code_ptr[(8l * random() % size)] + data_ptr[i];
   }
   return rbt;
}
 
int main() {
   size_t libsize = LIB_GBYTES * GB;
   size_t datasize = DATA_GBYTES * GB;
   int fd;
   char * dataptr;
   char * libptr;
 
   srandom(256);
   if ((fd = open("library.bin", O_RDONLY)) < 0) {
       printf("Required library.bin of size %ld\n", libsize);
       return 1;
   }
 
   if ((libptr = mmap(NULL, libsize,
                     PROT_READ, MAP_SHARED, fd, 0)) == MAP_FAILED) {
       printf("Failed build libptr due %d\n", errno);
       return 1;
   }
 
   if ((dataptr = mmap(NULL, datasize,
                       PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS,
                       -1, 0)) == MAP_FAILED) {
       printf("Failed build dataptr due %d\n", errno);
       return 1;
   }
 
   printf("Preparing test ...\n");
   memset(dataptr, 0, datasize);
   printf("Doing test ...\n");
 
   unsigned long chunk_size = GB;
   unsigned long chunk_count = (DATA_GBYTES - 3) * GB / chunk_size;
   for (unsigned long chunk=0 ; chunk < chunk_count; chunk++) {
       printf("Iteration %d of %d\n", 1 + chunk, chunk_count);
       random_read(libptr, dataptr + (chunk * chunk_size), libsize);
   }
   return 0;
}

      
      





Test without using swap



Disable swap by specifying vm.swappines = 0 and run the test

$ time ./swapdemo 
Preparing test ...
Killed

real 0m6,279s
user 0m0,459s
sys 0m5,791s
      
      







What happened? The swappiness value = 0 disabled the swap - anonymous pages are no longer pushed into it, that is, the data is always in memory. The problem is that the remaining 2GB was not enough for Chrome and VSCode running in the background, and the OOM-killer killed the test program. And at the same time, the lack of memory buried the Chrome tab in which I wrote this article. And I didn't like it - even if the autosave worked. I don't like when my data is "buried".



Swap included



Set vm_swappines = 60 (default)

Run the test:



$ time ./swapdemo 
Preparing test ...
Doing test ...
Iteration 1 of 11
Iteration 2 of 11
Iteration 3 of 11
Iteration 4 of 11
Iteration 5 of 11
Iteration 6 of 11
Iteration 7 of 11
Iteration 8 of 11
Iteration 9 of 11
Iteration 10 of 11
Iteration 11 of 11

real 1m55,291s
user 0m2,692s
sys 0m20,626s

      
      





Fragment top:



Tasks: 298 total,   2 running, 296 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0,6 us,  3,1 sy,  0,0 ni, 85,7 id, 10,1 wa,  0,5 hi,  0,0 si,  0,0 st
MiB Mem :  15670,0 total,    156,0 free,    577,5 used,  14936,5 buff/cache
MiB Swap:  16384,0 total,  12292,5 free,   4091,5 used.   3079,1 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  10393 viking    20   0   17,0g  14,2g  14,2g D  17,3  93,0   0:18.78 swapdemo
    136 root      20   0       0      0      0 S   9,6   0,0   4:35.68 kswapd0

      
      





Bad, bad Linux !!! It uses almost 4 gigabytes of swap although it has 14 gigabytes of cache and 3 gigabytes available! Linux has the wrong settings! Bad outlingo, bad old admins, they don't understand anything, they said to enable swap and now they make the system swap and work badly for me. It is necessary to disable swap as advised by much younger and promising Internet experts, because they know exactly what to do!



Well ... so be it. Let's turn off the swap as much as possible on the advice of the experts?



Test with almost no swap



We set vm_swappines = 1



This value will lead to the fact that the swap of anonymous pages will be performed only if there is no other way out.



I trust Chris Down because I think he's a great engineer and knows what he says when he explains that the swap file makes the system perform better. Therefore, expecting that “something” would “go wrong” and the system might work terribly inefficiently, I made sure in advance and ran the test program, limiting it with a timer to see at least its abnormal termination.



Let's look at the top output first:



Tasks: 302 total,   1 running, 301 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0,2 us,  4,7 sy,  0,0 ni, 84,6 id, 10,0 wa,  0,4 hi,  0,0 si,  0,0 st
MiB Mem :  15670,0 total,    162,8 free,   1077,0 used,  14430,2 buff/cache
MiB Swap:  20480,0 total,  18164,6 free,   2315,4 used.    690,5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   6127 viking    20   0   17,0g  13,5g  13,5g D  20,2  87,9   0:10.24 swapdemo
    136 root      20   0       0      0      0 S  17,2   0,0   2:15.50 kswapd0

      
      





Hooray?! The swap is used only for 2.5 gigabytes, which is almost 2 times less than in the test with swap enabled (and swappiness = 60). Swap is used less. There is also less free memory. And we can probably safely give the victory to young experts. But here's the strange thing - our program was never able to complete even 1 (ONE!) Iteration in 2 (TWO!) Minutes:



$ { sleep 120 ; killall swapdemo ; } &
[1] 6121
$ time ./swapdemo
Preparing test …
Doing test …
Iteration 1 of 11
[1]+  Done                    { sleep 120; killall swapdemo; }
Terminated

real	1m58,791s
user	0m0,871s
sys	0m23,998s
      
      





We repeat - the program was unable to complete 1 iteration in 2 minutes, although in the previous test it did 11 iterations in 2 minutes - that is, with the swap almost disabled, the program runs more than 10 (!) Times slower.



But there is one plus - not a single Chrome tab was harmed. And this is good.



Test with completely disabling swap



But maybe just "crushing" the swap through swappiness is not enough, and it should be completely disabled? Naturally, this theory should also be tested. We came here to conduct tests, or what?



This is the ideal case:



  • we have no swap and all our data will be guaranteed in memory
  • the swap will not be used even accidentally, because it is not there


And now our test will end with lightning speed, the old people will go to the place they deserve and will change cartridges - the way for the young.



Unfortunately, the result of running the test program is the same - not even one iteration has been completed.



Top output:



Tasks: 217 total,   1 running, 216 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0,0 us,  2,2 sy,  0,0 ni, 85,2 id, 12,6 wa,  0,0 hi,  0,0 si,  0,0 st
MiB Mem :  15670,0 total,    175,2 free,    331,6 used,  15163,2 buff/cache
MiB Swap:      0,0 total,      0,0 free,      0,0 used.    711,2 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    136 root      20   0       0      0      0 S  12,5   0,0   3:22.56 kswapd0
   7430 viking    20   0   17,0g  14,5g  14,5g D   6,2  94,8   0:14.94 swapdemo

      
      





Why is this happening



The explanation is very simple - the “code segment” that we connect via mmap (libptr) is in the cache. Therefore, when we prohibit (or almost prohibit) swap in one way or another, it does not matter how - by physically disabling swap, or through vm.swappines = 0 | 1 - it always ends with the same scenario - flushing the mmap file from the cache and then loading it from disk. And libraries are loaded exactly through mmap, and to verify this, you just need to do ls -l / proc // map_files:



$ ls -l /proc/8253/map_files/ | head -n 10
total 0
lr-------- 1 viking viking 64   7 12:58 556799983000-55679998e000 -> /usr/libexec/gnome-session-binary
lr-------- 1 viking viking 64   7 12:58 55679998e000-5567999af000 -> /usr/libexec/gnome-session-binary
lr-------- 1 viking viking 64   7 12:58 5567999af000-5567999bf000 -> /usr/libexec/gnome-session-binary
lr-------- 1 viking viking 64   7 12:58 5567999c0000-5567999c4000 -> /usr/libexec/gnome-session-binary
lr-------- 1 viking viking 64   7 12:58 5567999c4000-5567999c5000 -> /usr/libexec/gnome-session-binary
lr-------- 1 viking viking 64   7 12:58 7fb22a033000-7fb22a062000 -> /usr/share/glib-2.0/schemas/gschemas.compiled
lr-------- 1 viking viking 64   7 12:58 7fb22b064000-7fb238594000 -> /usr/lib/locale/locale-archive
lr-------- 1 viking viking 64   7 12:58 7fb238594000-7fb2385a7000 -> /usr/lib64/gvfs/libgvfscommon.so
lr-------- 1 viking viking 64   7 12:58 7fb2385a7000-7fb2385c3000 -> /usr/lib64/gvfs/libgvfscommon.so

      
      





And, as we considered in the first part of the article, the system, in conditions of an actual lack of memory, when swapping of anonymous pages is disabled, will choose the only option that was left by the owner who disabled the swap. And this option is reclaiming (freeing) blank pages occupied by the data of mmap-loaded libraries.



Conclusion



The active use of the "I take everything with me" (flatpak, snap, docker image) software distribution method leads to the fact that the amount of code that is connected via mmap increases significantly.



This can lead to the fact that the use of "extreme optimizations" related to setting / disabling swap can lead to completely unexpected effects, because a swap file is a mechanism for optimizing the virtual memory subsystem under conditions of memory pressure, and available memory is completely not "unused memory", but the sum of the cache and free memory.



By disabling the swap file, you do not "remove the wrong option", but "leave no options"



You should be very careful when interpreting process memory consumption data - VSS and RSS. They represent "current state" and not "optimal state".



If you do not want the system to use the swap, add memory to it, but do not disable the swap . Turning off the swap at threshold levels will make the situation much worse than it would have been if the system had swapped a little.



PS: In the discussions, questions are regularly asked "but if you enable memory compression via zram ...". I got curious, and I ran the appropriate tests: if you enable zram and swap, as is done by default in Fedora, then the runtime accelerates to about 1 minute.



But the reason for this is that pages with zeros are compressed very well, therefore, in fact, the data does not go to swap, but is stored in a compressed form in RAM. If you fill a data segment with random, poorly compressible data, the picture will become less spectacular and the test run time will again increase to 2 minutes, which is comparable (and even slightly worse) than that of an "honest" swap file.



All Articles