⛹🏼 🏫 👨‍🍳 Compiling C / C ++ on Apple M1 👰🏿 👉🏼 🧡

Intrigued by the M1's impressive benchmarks, I took out the latest Mac Mini to measure my compilation speed in C / C ++.

We measure the local build2 (without a package repository), which includes mainly C ++ code (611 translation units) with some C blocks (29) and links between them (19). This benchmark only requires a C ++ compiler and is included in the Phoronix test suite , so it can be compared to a large number of processors.

The Phoronix benchmark currently uses build2 0.12.0, we have 0.13.0 (the current release), here the build is about 10% slower.

After setting up Mac OS and installing the command line tools for XCode 12.2, we have everything we need:

$ clang++ --version
Apple clang version 12.0.0 (clang-1200.0.32.27)
Target: arm64-apple-darwin20.1.0

Judging by _LIBCPP_VERSION

the title __version

file libc++

, this version of Apple's Clang Clang vanilla diverged from somewhere in the development process 10.0.0.

You may have also noticed that the name of the processor in the Apple Clang triplet is different from the standard one aarch64

. Actually config.guess

shows the following:
$ ./config.guess
aarch64-apple-darwin20.1.0
      
      

        
        
        
      

    
        
        
        
      
      

        
        
        
      

    
    
To avoid using two names for the same, build2 canonized arm64

in aarch64

, so buildfiles

we always see aarch64 in.

Let's check the number of hardware threads in sysctl

:

$ sysctl -n hw.ncpu
8

There are 8 threads here, these are 4 productive cores and 4 energy efficient ones. In the first run, we use all the cores. Obviously this gives the best result:

$ time sh ./build2-install-0.13.0.sh --local --yes ~/install
163s

It was a pleasant surprise that build2 0.13.0 worked without any problems, although it came out earlier than M1. Since ARM has weak memory ordering, this also served as an additional test for the multithreaded implementation of build2 and heavy use of atomics.

First, let's compare the M1 to my workstation on an 8-core Intel Xeon E-2288G (essentially an i9-9900K plus ECC). The same build on vanilla Clang takes 131 seconds. Although this is the best result, the performance of the M1 is still impressive. Especially when you consider that during compilation, the workstation literally spews hot air and hums like an airplane, and the M1 rustles quietly with a barely noticeable stream of warm air.

A single-threaded benchmark evaluates the CPU performance in incremental builds:

$ time sh. /build2-install-0.13.0.sh --local --yes-j 1 ~ / install
691s

The E-2288G core takes 826 seconds. So the 5GHz Xeon core is actually slower than the 3.2GHz M1 core.

Another interesting result is a four-thread run, which uses only the efficient M1 cores:

$ time sh ./build2-install-0.13.0.sh --local --yes -j 4 ~/install
207s

Although it is somewhat slower than the eight-core test, it uses less memory. Thus, this option makes sense on systems with insufficient RAM (as on all modern M1 machines).

Here is a summary of all the results:

CPU CORES / THREADS TIME
-------------------------
E-2288G 8/16 131s
M1 4 + 4 163s
M1 4 207s
M1 1 691s
E-2288G 1 826s

It is clear that in many respects this is an apples-to-oranges comparison (workstation versus mobile device, old design and process technology versus modern, etc.)

Now let's add some interesting results from the Phoronix benchmark. In particular, it is appropriate to take the indicators of the latest workstations and mobile processors from Intel and AMD. Here's my selection (you can make your own, just remember to add an extra 10% to the Phoronix results; also note that most tests use GCC instead of Clang):

CPU CORES / THREADS TIME
------------------------------------------
AMD Threadripper 3990X 64/128 56s
AMD Ryzen 5950X 16/32 71s
Intel Xeon E-2288G 8/16 131s
Apple M1 4 + 4 163s
AMD   Ryzen        4900HS   8/16      176s*
Apple                 M1    4         207s
AMD   Ryzen        4700U    8/8       222s
Intel Core         1185G    4/8       281s*
Intel Core         1165G    4/8       295s

* .

Please note that the results for the best mobile Intel (1185G) and AMD (4900HS) are unfortunately not yet available and the figures quoted are extrapolated based on clocks and other benchmarks.

It's easy to see from the table above that the Apple M1 is an impressive processor, especially when considering power consumption. Moreover, it is the first mainstream desktop-grade ARM processor. For comparison, the same build on a Raspberry Pi 4B takes 1724 seconds, which is more than 10 times slower! Although we cannot boot Linux or Windows here, there is some evidence that they run on virtual machines with decent performance. As a result, the ARM-based continuous build pipeline may become standard.

Having seen the benchmarks of the M1, one cannot help wondering how Apple did this. Although there is a lot of speculation with some elements of black magic and witchcraft, but this article about M1 on Anandtech (and another one there by the link) seemed to me quite a good source of technical information . Highlights:

TSMC 5

10 ( 11x5G, 14 E-2288G) 7 AMD/TSMC.

LPDDR4-4266 RAM

Intel AMD .

L1

M1 L1 .

L2

Intel AMD, L2 , L3, M1 L2.

M1 has an unusually wide kernel that executes multiple instructions in parallel and / or out of order. There is speculation that due to ARM's poor memory ordering and fixed-size instruction encoding, Apple was able to make a much wider kernel.

It would also be interesting to see how Apple can scale this design to more cores.

Compiling C / C ++ on Apple M1

More articles: