About caches in ARM microcontrollers

imageHello!



In the previous article, we used a processor cache to accelerate graphics on a microcontroller in Embox . We used the "write-through" mode. Then we wrote about some of the advantages and disadvantages associated with the "write-through" mode, but this was just a cursory overview. In this article, as promised, I want to take a closer look at the types of caches in ARM microcontrollers, as well as compare them. Of course, all this will be considered from the point of view of a programmer, and we do not plan to go into the details of the memory controller in this article.



I'll start with where I stopped in the previous article, namely, the difference between "write-back" and "write-through" modes, since these two modes are most often used. In short:



  • "Write-back". Write data goes only to the cache. The actual write to memory is deferred until the cache becomes full and space is required for new data.
  • "Write-through". Writing occurs “simultaneously” to the cache and memory.


Write-through



The advantages of write-through are considered to be ease of use, which potentially reduces errors. Indeed, in this mode the memory is always in the correct state and does not require additional update procedures.



Of course, it seems like this should have a big impact on performance, but the STM itself in this document says that it is not:

Write-through: triggers a write to the memory as soon as the contents on the cache line are written to. This is safer for the data coherency, but it requires more bus accesses. In practice, the write to the memory is done in the background and has a little effect unless the same cache set is being accessed repeatedly and very quickly. It is always a tradeoff.
That is, initially we assumed that since the writing is to memory, then the performance on write operations will be about the same as without a cache at all, and the main gain occurs due to repeated reads. However, STM refutes this, it says that data in memory gets "in the background", so the write performance is almost the same as in "write-back" mode. This, in particular, may depend on the internal buffers of the memory controller (FMC).



Disadvantages of the "write-through" mode:



  • Sequential and fast access to the same memory can degrade performance. In the "write-back" mode, sequential frequent accesses to the same memory will, on the contrary, be a plus.
  • As in the case of "write-back", you still need to do a cache invalidate after the end of DMA operations.
  • Bug “Data corruption in a sequence of Write-Through stores and loads” in some versions of Cortex-M7. It was pointed out to us by one of the LVGL developers.


Write-back



As mentioned above, in this mode (as opposed to "write-through") data generally does not enter memory by writing, but only goes into the cache. Like write-through, this strategy has two sub-options - 1) write allocate, 2) no write allocate. We will talk about these options further.



Write Allocate



As a rule, "read allocate" is always used in caches - that is, on a cache miss for reading, data is fetched from memory and placed in the cache. Likewise, a write miss can cause data to be loaded into the cache ("write allocate") or not loaded ("no write allocate").



Typically, in practice, the combinations "write-back write allocate" or "write-through no write allocate" are used. Further in the tests we will try to check in a little more detail in which situations to use "write allocate", and in which "no write allocate".



MPU



Before moving on to the practical part, we need to figure out how to set the parameters of the memory region. To select the cache mode (or disable it) for a specific region of memory in the ARMv7-M architecture, MPU (Memory Protection Unit) is used.



The MPU controller supports setting memory regions. Specifically in the ARMV7-M architecture, there can be up to 16 regions. For these regions, you can independently set: start address, size, access rights (read / write / execute, etc.), attributes - TEX, cacheable, bufferable, shareable, as well as other parameters. With this mechanism, in particular, you can achieve any type of caching for a specific region. For example, we can get rid of the need to call cache_clean / cache_invalidate by simply allocating a region of memory for all DMA operations and marking that memory as non-cacheable.



An important point to note when working with MPU:

The base address, size and attributes of a region are all configurable, with the general rule that all regions are naturally aligned. This can be stated as:

RegionBaseAddress [(N-1): 0] = 0, where N is log2 (SizeofRegion_in_bytes)
In other words, the starting address of the memory region must be aligned to its own size. If, for example, you have a 16 Kb region, then you need to align it by 16 Kb. If the memory region is 64 KB, then align to 64 KB. And so on. If this is not done, then the MPU can automatically "crop" the region to the size corresponding to its starting address (tested in practice).



By the way, there are several bugs in STM32Cube. For instance:



  MPU_InitStruct.BaseAddress = 0x20010000;
  MPU_InitStruct.Size = MPU_REGION_SIZE_256KB;


You can see that the start address is 64 KB aligned. And we want the size of the region to be 256 KB. In this case, you will have to create 3 regions: the first 64 Kb, the second 128 Kb, and the third 64 Kb.



You only need to specify regions with different from the standard properties. The fact is that the attributes of all memories when the processor cache is enabled are described in the ARM architecture. There is a standard set of properties (for example, this is why STM32F7 SRAM has a “write-back write-allocate” mode by default). Therefore, if you need a non-standard mode for some of the memories, you will need to set its properties via MPU. At the same time, within the region, you can set a sub-region with its own properties, Selecting within this region another one with a high priority with the required properties.



TCM



As follows from the documentation (section 2.3 Embedded SRAM), the first 64 KB of SRAM in STM32F7 is non-cacheable. In the ARMv7-M architecture itself, SRAM is located at 0x20000000. TCM also refers to SRAM, but is located on a different bus relative to the rest of the memories (SRAM1 and SRAM2), and is located "closer" to the processor. Because of this, this memory is very fast, in fact, has the same speed as the cache. And because of this, caching is not needed, and this region cannot be made cacheable. In fact, TCM is another such cache.



Instruction cache



It should be noted that everything discussed above refers to the data cache (D-Cache). But besides the data cache, ARMv7-M also provides an instruction cache - Instruction cache (I-Cache). I-Cache allows you to transfer some of the executable (and subsequent) instructions to the cache, which can significantly speed up the program. Especially in cases where the code is in slower memory than FLASH, for example, QSPI.



To reduce the unpredictability in the tests with the cache below, we will intentionally disable I-Cache and think exclusively about the data.



At the same time, I want to note that turning on I-Cache is quite simple and does not require any additional actions from the MPU, unlike D-Cache.



Synthetic tests



After discussing the theoretical part, let's move on to the tests to better understand the difference and the scope of applicability of a particular model. As I said above, we disable I-Cache and only work with D-Cache. I also intentionally compile with -O0 so that the loops in the tests are not optimized. We will test through external SDRAM memory. Using MPU, I marked the 64 KB region, and we will expose the attributes we need to this region.



Since tests with caches are very capricious and are influenced by everything and everyone in the system - let's make the code linear and continuous. To do this, disable interrupts. Also, we will not measure time with timers, but DWT (Data Watchpoint and Trace unit), which has a 32-bit counter of processor cycles. On its basis (on the Internet) people make microsecond delays in drivers. The counter quickly overflows at the system frequency of 216 MHz, but you can measure up to 20 seconds. Let's just remember this, and make tests in this time interval, pre-zeroing the clock counter before starting.



You can view the complete test codes here . All tests were performed on the 32F769IDISCOVERY board .



Non-cacheable memory VS. write-back



So let's start with some very simple tests.



We just write sequentially to memory.



    dst = (uint8_t *) DATA_ADDR;

    for (i = 0; i < ITERS * 8; i++) {
        for (j = 0; j < DATA_LEN; j++) {
            *dst = VALUE;
            dst++;
        }
        dst -= DATA_LEN;
    }


We also write sequentially to memory, but not one byte at a time, but expand the loops a little.



    for (i = 0; i < ITERS * BLOCKS * 8; i++) {
        for (j = 0; j < BLOCK_LEN; j++) {
            *dst = VALUE;
            *dst = VALUE;
            *dst = VALUE;
            *dst = VALUE;
            dst++;
        }
        dst -= BLOCK_LEN;
    }


We also write sequentially to memory, but now we will also add reading.



    for (i = 0; i < ITERS * BLOCKS * 8; i++) {
        dst = (uint8_t *) DATA_ADDR;

        for (j = 0; j < BLOCK_LEN; j++) {
            val = VALUE;
            *dst = val;
            val = *dst;
            dst++;
        }
    }


If you run all these three tests, they will give exactly the same result, no matter which mode you choose:



mode: nc, iters=100, data len=65536, addr=0x60100000
Test1 (Sequential write):
  0s 728ms
Test2 (Sequential write with 4 writes per one iteration):
  7s 43ms
Test3 (Sequential read/write):
  1s 216ms


And this is reasonable, SDRAM is not that slow, especially when you consider the internal buffers of the FMC through which it is connected. Nevertheless, I expected a slight variation in the numbers, but it turned out that it was not on these tests. Well, let's think further.



Let's try to "spoil" the life of SDRAM by mixing reads and writes. To do this, let's expand the loops and add such a common thing in practice as the increment of an array element:



    for (i = 0; i < ITERS * BLOCKS; i++) {
        for (j = 0; j < BLOCK_LEN; j++) {
            // 16 lines
            arr[i]++;
            arr[i]++;
	***
            arr[i]++;
        }
    }


Result:



  :   4s 743ms
Write-back:                     :   4s 187ms


Already better - with the cache it turned out to be half a second faster. Let's try to complicate the test even more by adding access by “sparse” indexes. For example, with one index:



    for (i = 0; i < ITERS * BLOCKS; i++) {
        for (j = 0; j < BLOCK_LEN; j++) {
            arr[i + 0 ]++;
            ***
            arr[i + 3 ]++;
            arr[i + 4 ]++;
            arr[i + 100]++;
            arr[i + 6 ]++;
            arr[i + 7 ]++;
            ***
            arr[i + 15]++;
        }
    }


Result:



  :   11s 371ms
Write-back:                     :   4s 551ms


Now the difference with the cache has become more than noticeable! And to top it off, we introduce a second such index:



    for (i = 0; i < ITERS * BLOCKS; i++) {
        for (j = 0; j < BLOCK_LEN; j++) {
            arr[i + 0 ]++;
            ***
            arr[i + 4 ]++;
            arr[i + 100]++;
            arr[i + 6 ]++;
            ***
            arr[i + 9 ]++;
            arr[i + 200]++;
            arr[i + 11]++;
            arr[i + 12]++;
            ***
            arr[i + 15]++;
        }
    }


Result:



  :   12s 62ms
Write-back:                     :   4s 551ms


We see how the time for non-cached memory has grown by almost a second, while for the cache it remains the same.



Write allocate VS. no write allocate



Now let's deal with the "write allocate" mode. It's even harder to see the difference here, because if in the situation between non-cached memory and “write-back” they become clearly visible already starting from the 4th test, the differences between “write allocate” and “no write allocate” have not yet been revealed by the tests. Let's think - when will “write allocate” be faster? For example, when you have many writes to sequential memory locations, and there are few reads from those memory locations. In this case, in the “no write allocate” mode, we will receive constant misses, and the wrong elements will be loaded into the cache by reading. Let's simulate this situation:



    for (i = 0; i < ITERS * BLOCKS; i++) {
        for (j = 0; j < BLOCK_LEN; j++) {
            arr[j + 0 ]  = VALUE;
            ***
            arr[j + 7 ]  = VALUE;
            arr[j + 8 ]  = arr[i % 1024 + (j % 256) * 128];
            arr[j + 9 ]  = VALUE;
            ***
            arr[j + 15 ]  = VALUE;
        }
    }


Here, 15 out of 16 records are set to the VALUE constant, while reading is performed from different (and not related to writing) elements arr [i% 1024 + (j% 256) * 128]. It turns out that with the no write allocate strategy, only these elements will be loaded into the cache. The reason why such indexing is used (i% 1024 + (j% 256) * 128) is the “speed degradation” of FMC / SDRAM. Since memory accesses at significantly different (non-sequential) addresses can significantly affect the speed of work.



Result:



Write-back                                           :   4s 720ms
Write-back no write allocate:               :   4s 888ms


Finally, we got a difference, albeit not so noticeable, but already visible. That is, our hypothesis was confirmed.



And finally, the most difficult, in my opinion, case. We want to understand when “no write allocate” is better than “write allocate”. The first is better if we “often” refer to addresses that we will not work with in the near future. Such data does not need to be cached.



In the next test, in the case of "write allocate", the data will be filled on read and write. I made a 64KB array “arr2”, so the cache will be flushed to swap new data. In the case of “no write allocate”, I made an “arr” array of 4096 bytes, and only it will get into the cache, which means that the cache data will not be flushed into memory. Due to this, we will try to get at least a small win.



    arr = (uint8_t *) DATA_ADDR;
    arr2 = arr;

    for (i = 0; i < ITERS * BLOCKS; i++) {
        for (j = 0; j < BLOCK_LEN; j++) {
            arr2[i * BLOCK_LEN            ] = arr[j + 0 ];
            arr2[i * BLOCK_LEN + j*32 + 1 ] = arr[j + 1 ];
            arr2[i * BLOCK_LEN + j*64 + 2 ] = arr[j + 2 ];
            arr2[i * BLOCK_LEN + j*128 + 3] = arr[j + 3 ];
            arr2[i * BLOCK_LEN + j*32 + 4 ] = arr[j + 4 ];
            ***
            arr2[i * BLOCK_LEN + j*32 + 15] = arr[j + 15 ];
        }
    }


Result:



Write-back                                           :   7s 601ms
Write-back no write allocate:               :   7s 599ms


It can be seen that the "write-back" "write allocate" mode is slightly faster. But the main thing is that it is faster.



I didn't get a better demonstration, but I'm sure there are practical situations where the difference is more tangible. Readers can suggest their own options!



Practical examples



Let's move from synthetic to real examples.



ping



One of the simplest is ping. It is easy to start, and the time can be viewed directly on the host. Embox was built with the -O2 optimization. Let me give you the results right away:



    :  ~0.246 c
Write-back                        :  ~0.140 c


Opencv



Another example of a real problem on which we wanted to try the cache subsystem is OpenCV on STM32F7 . In that article, it was shown that it was quite possible to launch, but the performance was quite low. For demonstration, we will use a standard example that extracts borders based on the Canny filter. Let's measure the running time with and without caches (both D-cache and I-cache).



   gettimeofday(&tv_start, NULL);

    cedge.create(image.size(), image.type());
    cvtColor(image, gray, COLOR_BGR2GRAY);

    blur(gray, edge, Size(3,3));
    Canny(edge, edge, edgeThresh, edgeThresh*3, 3);
    cedge = Scalar::all(0);

    image.copyTo(cedge, edge);

    gettimeofday(&tv_cur, NULL);
    timersub(&tv_cur, &tv_start, &tv_cur);


Without cache:



> edges fruits.png 20 
Processing time 0s 926ms
Framebuffer: 800x480 32bpp
Image: 512x269; Threshold=20


With cache:



> edges fruits.png 20 
Processing time 0s 134ms
Framebuffer: 800x480 32bpp
Image: 512x269; Threshold=20


That is, 926ms and 134ms acceleration is almost 7 times.



In fact, we are often asked about OpenCV on STM32, in particular, what is the performance. It turns out FPS is certainly not high, but 5 frames per second is quite realistic to get.



Not cached or cached memory, but with cache invalidate?



In real devices, DMA is widely used, of course, difficulties are associated with it, because you need to synchronize memory even for the “write-through” mode. There is a natural desire to simply allocate a piece of memory that will not be cached and use it when working with DMA. A little distracted. On Linux, this is done by a function via dma_coherent_alloc () . And yes, this is a very effective method, for example, when working with network packets in the OS, user data goes through a large stage of processing before reaching the driver, and in the driver, the prepared data with all the headers is copied into buffers that use non-cached memory.



Are there cases when a clean / invalidate is preferable in a driver with DMA? Yes there is. For example, video memory, which prompted ustake a closer look at how cache () works. In double buffering mode, the system has two buffers, into which it draws in turn, and then gives it to the video controller. If you make such memory non-cacheable, then there will be a drop in performance. Therefore, it is better to clean before sending the buffer to the video controller.



Conclusion



We figured out a bit about the different types of caches in ARMv7m: write-back, write-through, as well as the “write allocate” and “no write allocate” settings. We built synthetic tests in which we tried to find out when one mode is better than the other, and also considered practical examples with ping and OpenCV. At Embox, we are still working on this topic, so the corresponding subsystem is still being worked out. The advantages of using caches are definitely noticeable though.



All examples can be viewed and reproduced by building Embox from the open repository.



PS



If you are interested in the topic of system programming and OSDev, then the OS Day conference will be held tomorrow ! This year it is online, so don't miss out on those who wish! Embox will perform tomorrow at 12.00



All Articles