I recently wrote my first 4K intro in Rust and presented it at Nova 2020, where it won first place in the New School Intro Competition. Writing a 4K intro is tricky. This requires knowledge of many different areas. Here I will focus on techniques for how to shorten Rust code as much as possible.

You can watch the demo on Youtube , download the executable on Pouet, or get the source code from Github .

The 4K intro is a demo where the entire program (including any data) is 4096 bytes or less, so it's important that the code is as efficient as possible. Rust has some reputation for building bloated executables, so I wanted to see if it could be efficient and concise code.

Configuration

The whole intro is written in a combination of Rust and glsl. Glsl is used for rendering, but Rust does everything else: world creation, camera and object control, tool creation, music playback, etc.

There are dependencies in the code on some features that are not yet included in stable Rust, so I use the toolbox Nightly Rust. To install and use this default bundle, run the following rustup commands:

rustup toolchain install nightly
rustup default nightly

I am using crinkler to compress an object file generated by the Rust compiler.

I also used a shader minifier to preprocess the shader glslto make it smaller and more crinkler friendly. The shader minifier does not support output to .rs, so I took the raw output and manually copied it to my shader.rs file (hindsight it was clear that I needed to somehow automate this step. Or even write a pull request for the shader minifier) ...

The starting point was my past 4K intro on Rust , which seemed pretty laconic back then. That article also provides more details on configuring the file tomland how to use xargo to compile the tiny binary.

Optimization of program design to reduce code

Many of the most effective size optimizations are not smart hacks. This is the result of a rethinking of design.

In my original project, one part of the code created the world, including the placement of the spheres, and the other part was responsible for moving the spheres. At some point, I realized that the placement code and the sphere move code do very similar things, and you can combine them into one much more complex function that does both. Unfortunately, such optimizations make the code less elegant and less readable.

Assembler Code Analysis

At some point, you have to look at the compiled assembler and figure out what the code is compiled into and which size optimizations are worth it. The Rust compiler has a very useful option --emit=asmfor outputting assembly code. The following command creates an assembler file .s:

xargo rustc --release --target i686-pc-windows-msvc -- --emit=asm

You don't need to be an expert in assembly to benefit from learning the output of assembler, but it is definitely better to have a basic understanding of the syntax. This option opt-level = "zforces the compiler to optimize the code as much as possible for the smallest size. After that, it's a little more difficult to figure out which part of the assembly code corresponds to which part of the Rust code.

I've found that the Rust compiler can be surprisingly good at minifying, removing unused code and unnecessary parameters. It also does some strange things, so it is very important to study the result in assembly from time to time.

Additional functions

I've worked with two versions of the code. One records the process and allows the viewer to manipulate the camera to create interesting trajectories. Rust allows you to define functions for these additional actions. The file tomlhas a [features] section that allows you to declare the available features and their dependencies. In tomlmy intro 4K have the following profile:

[features]
logger = []
fullscreen = []

None of the additional functions have dependencies, so they effectively act as conditional compilation flags. Conditional blocks of code are preceded by a statement #[cfg(feature)]. Using functions by itself doesn't make your code smaller, but it makes the development process much easier when you easily switch between different sets of functions.

        #[cfg(feature = "fullscreen")]
        {
            //       ,    
        }

        #[cfg(not(feature = "fullscreen"))]
        {
            //       ,     
        }

After examining the compiled code, I am sure that only the selected features are included.

One of the main uses of the functions was to enable logging and error checking for a debug build. Loading the code and compiling the glsl shader often failed, and without helpful error messages it would be extremely difficult to find problems.

Using get_unchecked

When placing the code inside the block, unsafe{}I kind of assumed that all security checks would be disabled, but this is not the case. All the usual checks are still performed there, and they are expensive.

By default, range checks all calls to the array. Take the following Rust code:

    delay_counter = sequence[ play_pos ];

Before the table lookup, the compiler will insert code that checks that play_pos is not indexed past the end of the sequence, and panics if it does. This adds a significant size to the code because there can be many such functions.

Let's transform the code as follows:

    delay_counter = *sequence.get_unchecked( play_pos );

This tells the compiler to not do any range checks and just look up the table. This is clearly a dangerous operation and therefore can only be performed within the code unsafe.

More efficient cycles

Initially, all of my loops ran idiomatically as expected in Rust using syntax for x in 0..10. I assumed that it would be compiled in as tight a loop as possible. Surprisingly, this is not the case. The simplest case:

for x in 0..10 {
    // do code
}

will be compiled into assembly code that does the following:

    setup loop variable
loop:
          
      ,   end
    //    
       loop
end:

whereas the following code

let x = 0;
loop{
    // do code
    x += 1;
    if x == 10 {
        break;
    }
}

compiles directly to:

    setup loop variable
loop:
    //    
          
       ,   loop
end:

Note that the condition is checked at the end of each loop, making an unconditional jump unnecessary. This is a small space saving for one cycle, but they really add up to a pretty good saving when there are 30 cycles in the program.

Another, much more difficult to grasp problem with Rust's idiomatic loop is that in some cases the compiler added some extra iterator setup code that really bloated the code. I still haven't figured out what is causing this extra iterator setup, since it has always been trivial to replace constructs with for {}constructs loop{}.

Using vector instructions

I spent a lot of time optimizing the code glsl, and one of the best optimizations (which usually also makes the code work faster) is to work with the entire vector at the same time, rather than with each component in turn.

For example, the ray tracing code uses a fast mesh traversal algorithm to check which parts of the map each ray is visiting. The original algorithm considers each axis separately, but you can rewrite it so that it considers all axes at the same time and does not need any branches. Rust doesn't actually have a vector type of its own like glsl, but you can use internals to tell it to use SIMD instructions.

To use the built-in functions, I would convert the following code

        global_spheres[ CAMERA_ROT_IDX ][ 0 ] += camera_rot_speed[ 0 ]*camera_speed;
        global_spheres[ CAMERA_ROT_IDX ][ 1 ] += camera_rot_speed[ 1 ]*camera_speed;
        global_spheres[ CAMERA_ROT_IDX ][ 2 ] += camera_rot_speed[ 2 ]*camera_speed;

into this:

        let mut dst:x86::__m128 = core::arch::x86::_mm_load_ps(global_spheres[ CAMERA_ROT_IDX ].as_mut_ptr());
        let mut src:x86::__m128 = core::arch::x86::_mm_load_ps(camera_rot_speed.as_mut_ptr());
        dst = core::arch::x86::_mm_add_ps( dst, src);
        core::arch::x86::_mm_store_ss( (&mut global_spheres[ CAMERA_ROT_IDX ]).as_mut_ptr(), dst );

which will be slightly smaller (and much less readable). Unfortunately, for some reason this broke the debug build, although it worked fine in the release build. Clearly the problem here is with my knowledge of Rust internals, not the language itself. It is worth spending more time on this when preparing the next 4K intro, since the reduction in the amount of code was significant.

Using OpenGL

There are many standard Rust crates for loading OpenGL functions, but by default they all load a very large set of functions. Each loaded function takes up some space because the loader needs to know its name. Crinkler is very good at compressing this kind of code, but it can't get rid of the overhead completely, so I had to create my own version gl.rsthat includes only the OpenGL features I needed.

Conclusion

The main goal was to write a competitively correct 4K intro and prove that Rust is suitable for demoscene and scenarios where every byte counts and you really need low-level control. As a rule, only assembler and C were considered in this area. The additional goal was to make the most of idiomatic Rust.

It seems to me that I have coped with the first task quite successfully. It never felt like Rust was holding me back in some way, or that I was sacrificing performance or features because I was using Rust and not C. The

second task was less successful. There is too much unsafe code that really shouldn't be in there.unsafehas a destructive effect; it is very easy to use it to quickly execute something (for example, using mutable static variables), but as soon as unsafe code appears, it generates even more unsafe code, and suddenly it is all over the place. In the future, I will be much more careful to use unsafeonly when there really is no alternative.

How I wrote a 4K intro in Rust - and it won