We understand the features of the graphics subsystem of microcontrollers

Hello!



In this article, I would like to talk about the features of the implementation of a graphical user interface with widgets on a microcontroller and how to have both a familiar user interface and a decent FPS. I would like to focus not on any specific graphics library, but on general things - memory, processor cache, dma, and so on. Since I am a developer of the Embox team , the examples and experiments will be on this RT OS.





Earlier we already talked about running the Qt library on a microcontroller . The animation turned out to be quite smooth, but at the same time the memory costs even for storing the firmware were significant - the code was executed from the external QSPI flash memory. Of course, when a complex and multifunctional interface is required, which also knows how to do some kind of animation, then the cost of hardware resources can be quite justified (especially if you already have this code developed for Qt).



But what if you don't need all of Qt's functionality? What if you have four buttons, one volume control and a couple of popup menus? At the same time, I want it to “look nice and work fast” :) Then it will be advisable to use more lightweight tools, for example, the lvgl library or similar.



In our Embox project some time ago, Nuklear was ported - a project to create a very lightweight library consisting of one header and allowing you to easily create a simple GUI. We decided to use it to create a small application in which there will be a widget with a set of graphic elements and which could be controlled via a touchscreen.



STM32F7-Discovery with Cortex-M7 and touch screen was chosen as a platform.



First optimizations. Save memory



So, the graphics library is selected, so is the platform. Now let's understand what the resources are. It is worth noting here that the main memory SRAM is many times faster than the external SDRAM, so if the screen sizes allow you, then of course it is better to put the framebuffer in SRAM. Our screen has a resolution of 480x272. If we want a color of 4 bytes per pixel, then we get about 512 KB. At the same time, the size of the internal RAM is only 320 and it is immediately clear that the video memory will be external. Another option is to reduce the color bit depth to 16 (i.e. 2 bytes), and thus reduce the memory consumption to 256 KB, which can already fit into the main RAM.



The first thing you can try is to save on everything. Let's make a 256 Kb video buffer, place it in RAM and draw in it. The problem that we immediately encountered was the “flickering” of the scene that occurs when drawing directly into video memory. Nuklear redraws the entire scene from scratch, so every time the entire screen is filled first, then the widget is drawn, then a button is put into it, into which the text is placed, and so on. As a result, the naked eye can see how the whole scene is redrawn and the picture “blinks”. That is, a simple placement in the internal memory does not save.



Intermediate buffer. Compiler optimizations. FPU



After we fiddled with the previous method (placement in internal memory) for a bit, memories of X Server and Wayland immediately began to come to mind. Yes, indeed, in fact, window managers are engaged in processing requests from clients (just our custom application), and then collecting the elements into the final scene. For example, the Linux kernel sends events from input devices to the server via the evdev driver. The server, in turn, determines which client to address the event. Clients, having received an event (for example, pressing on a touch screen), execute their internal logic - they highlight the button, display a new menu. Further (slightly differently for X and Wayland), either the client itself or the server draws the changes to the buffer. And then the compositor is putting all the pieces together for drawing to the screen.Simple enough and schematic explanation herehere .



It became clear that we need similar logic, but we really don't want to push X Server into stm32 for the sake of a small application. Therefore, let's try to just draw not in video memory, but in ordinary memory. After rendering the entire scene, it will copy the buffer to video memory.



Widget code
        if (nk_begin(&rawfb->ctx, "Demo", nk_rect(50, 50, 200, 200),
            NK_WINDOW_BORDER|NK_WINDOW_MOVABLE|
            NK_WINDOW_CLOSABLE|NK_WINDOW_MINIMIZABLE|NK_WINDOW_TITLE)) {
            enum {EASY, HARD};
            static int op = EASY;
            static int property = 20;
            static float value = 0.6f;

            if (mouse->type == INPUT_DEV_TOUCHSCREEN) {
                /* Do not show cursor when using touchscreen */
                nk_style_hide_cursor(&rawfb->ctx);
            }

            nk_layout_row_static(&rawfb->ctx, 30, 80, 1);
            if (nk_button_label(&rawfb->ctx, "button"))
                fprintf(stdout, "button pressed\n");
            nk_layout_row_dynamic(&rawfb->ctx, 30, 2);
            if (nk_option_label(&rawfb->ctx, "easy", op == EASY)) op = EASY;
            if (nk_option_label(&rawfb->ctx, "hard", op == HARD)) op = HARD;
            nk_layout_row_dynamic(&rawfb->ctx, 25, 1);
            nk_property_int(&rawfb->ctx, "Compression:", 0, &property, 100, 10, 1);

            nk_layout_row_begin(&rawfb->ctx, NK_STATIC, 30, 2);
            {
                nk_layout_row_push(&rawfb->ctx, 50);
                nk_label(&rawfb->ctx, "Volume:", NK_TEXT_LEFT);
                nk_layout_row_push(&rawfb->ctx, 110);
                nk_slider_float(&rawfb->ctx, 0, &value, 1.0f, 0.1f);
            }
            nk_layout_row_end(&rawfb->ctx);
        }
        nk_end(&rawfb->ctx);
        if (nk_window_is_closed(&rawfb->ctx, "Demo")) break;

        /* Draw framebuffer */
        nk_rawfb_render(rawfb, nk_rgb(30,30,30), 1);

        memcpy(fb_info->screen_base, fb_buf, width * height * bpp);




This example creates a 200 x 200 px window and draws graphics into it. The final scene itself is drawn into the fb_buf buffer, which we allocated to SDRAM. And then in the last line, memcpy is simply called. And everything repeats itself in an endless cycle.



If we just build and run this example, we get about 10-15 FPS. Which is certainly not very good, because it is noticeable even with the eye. Moreover, since the Nuklear render code contains a lot of floating point calculations, we enabled its support initially , without it the FPS would have been even lower. The first and simplest (free) optimization is of course the -O2 compiler flag.



Let's build and run the same example - we get 20 FPS. Better, but still not enough for a good job.



Enabling processor caches. Write-Through Mode



Before moving on to further optimizations, I will say that we are using the rawfb plugin as part of Nuklear, which draws directly into memory. Accordingly, memory optimization looks very promising. The first thing that comes to mind is cache.



In older versions of Cortex-M, such as Cortex-M7 (our case), an additional processor cache (instruction cache and data cache) is built in. It is enabled through the CCR register of the System Control Block. But with the inclusion of the cache new problems come - the inconsistency of data in the cache and memory. There are several ways to manage the cache, but in this article I will not dwell on them, so I will move on to one of the simplest, in my opinion. To solve the cache / memory inconsistency problem, we can simply mark all available memory as “non-cacheable”. This means that all writes to this memory will always go to memory and not to the cache. But if we mark all memory in this way, then there will be no point in the cache either. There is another option. This is a “pass-through” mode, in which all writes to memory marked as write through are simultaneously sent to the cache,and in memory. This creates a write overhead, but on the other hand, greatly speeds up reading, so the result will depend on the specific application.



For Nuklear, the write-through mode turned out to be very good - the performance went up from 20 FPS to 45 FPS, which in itself is already quite good and smooth. The effect is of course interesting, we even tried to disable write through mode, not paying attention to data inconsistency, but the FPS rose only to 50 FPS, that is, there was no significant increase in comparison with write through. From this we concluded that our application requires a lot of read operations, not writes. The question is, of course, where? Perhaps because of the number of transformations in the rawfb code, which often access memory to read the next coefficient or something like that.



Double buffering (so far with an intermediate buffer). Enabling DMA



I didn't want to stop at 45 FPS, so we decided to experiment further. The next idea was double buffering. The idea is widely known, and, in general, simple. We draw the scene using one device to one buffer, while the other device displays from another buffer. If you look at the previous code, you can clearly see a loop in which the scene is first drawn into the buffer, and then the contents are copied into the video memory using memcpy. It is clear that memcpy uses CPU, that is, rendering and copying happen sequentially. Our idea was that copying could be done in parallel using DMA. In other words, while the processor is drawing a new scene, the DMA copies the previous scene into video memory.



Memcpy is replaced with the following code:



            while (dma_in_progress()) {
            }

            ret = dma_transfer((uint32_t) fb_info->screen_base,
                    (uint32_t) fb_buf[fb_buf_idx], (width * height * bpp) / 4);
            if (ret < 0) {
                printf("DMA transfer failed\n");
            }

            fb_buf_idx = (fb_buf_idx + 1) % 2;


Here fb_buf_idx is entered - the index of the buffer. fb_buf_idx = 0 is the front buffer, fb_buf_idx = 1 is the back buffer. The dma_transfer () function takes destination, source and a number of 32 bit words. Then the DMA is charged with the required data, and the work continues with the next buffer.



After trying this mechanism, the performance increased to about 48 FPS. Slightly better than memcpy (), but only slightly. I don’t mean to say that DMA turned out to be useless, but in this particular example the impact of the cache on the big picture was better.



After a little surprise that DMA performed worse than expected, we came up with an “excellent”, as it seemed to us then, the idea to use several DMA channels. What's the point? The number of data that can be loaded into DMA at one time on stm32f7xx is 256 KB. At the same time, remember that the screen we have is 480x272 and the video memory is about 512 KB, which means it would seem that you can put the first half of the data in one DMA channel, and the second half in the second. And everything seems to be good ... But the performance drops from 48 FPS to 25-30 FPS. That is, we are returning to the situation when the cache has not yet been enabled. With what it can be connected? In fact, due to the fact that access to SDRAM memory is synchronized, even the memory is called Synchronous Dynamic Random Access Memory (SDRAM), so this option only adds additional synchronization,without making the write to the memory parallel, as desired. After a little reflection, we realized that there is nothing surprising here, because the memory is one, and the write and read cycles are generated to one microcircuit (on one bus), and since another source / receiver is added, then the arbiter, who resolves the calls on the bus , you need to mix command cycles from different DMA channels.



Double buffering. Working with LTDC



Copying from an intermediate buffer is certainly good, but as we found out, this is not enough. Let's take a look at another obvious improvement - double buffering. In the vast majority of modern display controllers, you can set the address to the video memory used. Thus, you can avoid copying altogether, and simply rearrange the video memory address to the prepared buffer, and the screen controller will take the data in the optimal way for it on its own via DMA. This is real double buffering, without an intermediate buffer as it was before. There is also an option when the display controller can have two or more buffers, which is essentially the same thing - we write to one buffer, and the other is used by the controller, while copying is not required.



The LTDC (LCD-TFT display controller) in the stm32f74xx has two hardware overlay levels - Layer 1 and Layer 2, where Layer 2 is overlaid on Layer 1. Each of the layers is independently configurable and can be enabled or disabled separately. We tried to enable only Layer 1 and rearrange the video memory address on the front buffer or back buffer. That is, we give one to the display, and in the other we draw at this time. But we got a noticeable jitter when switching overlays.



We tried the option when we use both layers with one of them on / off, that is, when each layer has its own video memory address, which does not change, and the buffer is changed by turning on one of the layers while turning off the other. The variation also resulted in jitter. And finally, we tried the option when the layer was not turned off, but the alpha channel was set to either zero 0 or maximum (255), that is, we controlled the transparency, making one of the layers invisible. But this option did not live up to expectations, the trembling was still present.



The reason was not clear - the documentation says that layer state updates can be performed on the fly. We made a simple test - we turned off the caches, floating point, drew a static picture with a green square in the center of the screen, the same for both Layer 1 and Layer 2, and began to switch levels in a loop, hoping to get a static picture. But we got the same shake again.



It became clear that it was something else. And then we remembered the alignment of the framebuffer address in memory. Since the buffers were allocated from the heap and their addresses were not aligned, we aligned their addresses by 1 KB - we got the expected picture without jitter. Then they found in the documentation that LTDC subtracts data in batches of 64 bytes, and that the unevenness of the data gives a significant loss in performance. In this case, both the address of the beginning of the framebuffer and its width must be aligned. To test, we changed the 480x4 width to 470x4, which is not divisible by 64 bytes, and got the same jitter.



As a result, we aligned both buffers by 64 bytes, made sure that the width was also aligned by 64 bytes and ran nuklear - the jitter disappeared. The solution that worked looks like this. Instead of switching between layers by completely disabling either Layer 1 or Layer, use transparency. That is, to disable the level, set its transparency to 0, and to enable it - to 255.



        BSP_LCD_SetTransparency_NoReload(fb_buf_idx, 0xff);

        fb_buf_idx = (fb_buf_idx + 1) % 2;

        BSP_LCD_SetTransparency(fb_buf_idx, 0x00);


We got 70-75 FPS! Much better than the original 15.



It should be noted that the solution works through transparency control, and the options with disabling one of the levels and the option with rearranging the level address give the picture jitter at FPS large 40-50, the reason is currently unknown to us. Also, running ahead, I will say that this is a solution for this board.



Hardware scene fill via DMA2D



But this is not the limit, our last optimization for increasing FPS is hardware scene filling. Before that, we did the filling programmatically:

nk_rawfb_render(rawfb, nk_rgb(30,30,30), 1);


Let's now tell the rawfb plugin that there is no need to fill the scene, but only paint over:

nk_rawfb_render(rawfb, nk_rgb(30,30,30), 0);


We will fill the scene with the same color 0xff303030, only in hardware via the DMA2D controller. One of the main functions of DMA2D is to copy or fill a rectangle in RAM. The main convenience here is that this is not a continuous piece of memory, but a rectangular area, which is located in memory with breaks, which means that ordinary DMA cannot be done right away. In Embox, we have not worked with this device yet, so let's just use the STM32Cube tools - the BSP_LCD_Clear (uint32_t Color) function. It programs the fill color and size of the entire screen in DMA2D.



Vertical Blanking Period (VBLANK)



But even at 80 FPS, a noticeable problem remained - parts of the widget moved with small “breaks” when moving across the screen. That is, the widget seemed to be divided into 3 (or more) parts that moved side by side, but with a slight delay. It turned out that the reason was an incorrect video memory update. More precisely, updates at the wrong time intervals.



The display controller has such a property as VBLANK, it is also VBI or Vertical Blanking Period . It denotes the time interval between adjacent video frames. Or more precisely, the time between the last line of the previous video frame and the first line of the next one. In this interval, no new data is transferred to the display, the picture is static. For this reason, it is safe to update video memory inside VBLANK.



In practice, the LTDC controller has an interrupt that is configured to be triggered after processing the next framebuffer line (LTDC line interrupt position configuration register (LTDC_LIPCR)). Thus, if we configure this interrupt to the last line number, then we will just get the beginning of the VBLANK interval. At this point we make the necessary buffer switching.



As a result of such actions, the picture returned to normal, the gaps were gone. But at the same time FPS fell from 80 to 60. Let's understand what could be the reason for this behavior.



The following formula can be found in the documentation :



          LCD_CLK (MHz) = total_screen_size * refresh_rate,


where total_screen_size = total_width x total_height. LCD_CLK is the frequency at which the display controller will load pixels from video memory to the screen (for example, via the Display Serial Interface (DSI)). But refresh_rate is already the refresh rate of the screen itself, its physical characteristic. It turns out, knowing the refresh rate of the screen and its dimensions, you can configure the frequency for the display controller. After checking the registers for the configuration that the STM32Cube creates, we found out that it tunes the controller to a 60 Hz screen. So it all came together.



A little about input devices in our example



Let's go back to our application and look at how the touchscreen works, because as you understand, the modern interface implies interactivity, that is, interaction with the user.



Everything is arranged quite simply here. Events from input devices are processed in the main program loop immediately before rendering the scene:



        /* Input */
        nk_input_begin(&rawfb->ctx);
        {
            switch (mouse->type) {
            case INPUT_DEV_MOUSE:
                handle_mouse(mouse, fb_info, rawfb);
                break;
            case INPUT_DEV_TOUCHSCREEN:
                handle_touchscreen(mouse, fb_info, rawfb);
                break;
            default:
                /* Unreachable */
                break;
            }
        }
        nk_input_end(&rawfb->ctx);


The very handling of events from the touchscreen occurs in the handle_touchscreen () function:



handle_touchscreen
static void handle_touchscreen(struct input_dev *ts, struct fb_info *fb_info,
        struct rawfb_context *rawfb) {
    struct input_event ev;
    int type;
    static int x = 0, y = 0;

    while (0 <= input_dev_event(ts, &ev)) {
        type = ev.type & ~TS_EVENT_NEXT;

        switch (type) {
        case TS_TOUCH_1:
            x = normalize_coord((ev.value >> 16) & 0xffff, 0, fb_info->var.xres);
            y = normalize_coord(ev.value & 0xffff, 0, fb_info->var.yres);
            nk_input_button(&rawfb->ctx, NK_BUTTON_LEFT, x, y, 1);
            nk_input_motion(&rawfb->ctx, x, y);
            break;
        case TS_TOUCH_1_RELEASED:
            nk_input_button(&rawfb->ctx, NK_BUTTON_LEFT, x, y, 0);
            break;
        default:
            break;
        }

    }
}




In fact, this is where input device events are converted into a format that Nuklear understands. Actually, that's probably all.



Launch on another board



Having received quite decent results, we decided to reproduce them on another board. We had another similar board - STM32F769I-DISCO. There is the same LTDC controller, but a different screen with a resolution of 800x480. After launching it got 25 FPS. That is, a noticeable drop in performance. This is easily explained by the size of the framebuffer - it is almost 3 times larger. But the main problem turned out to be different - the image was very distorted, there was no static image at the moment when the widget should be in one place.



The reason was not clear, so we went to look at standard examples from STM32Cube. There was an example with double buffering for this particular board. In this example, the developers, unlike the method with changing transparency, simply move the pointer to the framebuffer on the VBLANK interrupt. We have already tried this method earlier for the first board, but it did not work for it. But using this method for STM32F769I-DISCO, we got a fairly smooth picture change from 25 FPS.



Delighted, we tested this method again (with rearranging pointers) on the first board, but it still did not work at high FPS. As a result, the method with layer transparencies (60 FPS) works on one board, and the method with rearranging pointers (25 FPS) on the other. After discussing the situation, we decided to postpone unification until a deeper study of the graphics stack.



Outcome



So, let's summarize. The example shown represents a simple yet common GUI pattern for microcontrollers - a few buttons, a volume control, or something else. The example lacks any logic associated with events, since the emphasis was placed on the graphics. In terms of performance, we got quite a decent FPS value.



The accumulated nuances for optimizing performance lead to the conclusion that graphics are becoming more complicated in modern microcontrollers. Now, just like on large platforms, you need to monitor the processor cache, place something in external memory, and something in faster memory, use DMA, use DMA2D, monitor VBLANK, and so on. It all started to look like big platforms, and maybe that's why I've already referred to X Server and Wayland several times.



Perhaps one of the most unoptimized parts is the rendering itself, we redraw the entire scene from scratch, entirely. I cannot say how it is done in other libraries for microcontrollers, perhaps somewhere this stage is built into the library itself. But based on the results of working with Nuklear, it seems that in this place an analogue of X Server or Wayland is needed, of course, lighter weight, which again leads us to the idea that small systems repeat the path of large ones.



UPD1

As a result, the method with changing the transparency was not needed. On both boards, a common code worked - with swapping the buffer address by v-sync. Moreover, the method with transparencies is also correct, it is simply not necessary.



UPD2

I want to say a big thank you to all the people who suggested triple buffering, we haven't gotten to it yet. But now you can see that this is the classic method (especially for FPS of high frame rates to the display), which, among other things, will allow us to get rid of lags due to waiting for v-sync (i.e. when the software is noticeably ahead of the picture). We have not yet encountered this, but it is only a matter of time. And a special thank you for the discussion on triple buffering I want to saybesitzeruf and belav!



Our contacts:



Github: https://github.com/embox/embox

Newsletter: embox-ru [at] googlegroups.com

Telegram chat: t.me/embox_chat



All Articles