Downlocking Ice Lake AVX-512

image


This is a short post about a study of AVX2 and AVX-512 behavior in relation to licensed downlocking of new Intel Ice Lake chips.



Licensed Downlocking 1 is a little-known effect in which frequency limits fall below nominal when certain SIMD instructions are executed, especially heavy floating point instructions or instructions with 512-bit width.



You can read more about this kind of downlocking in this answer on StackOverflow , and we have already explained the low-level mechanics of such transitions in some detail . You can also find instructionshow to take advantage of wide SIMDs (Single Instruction Multiple Data: a type or extension of the instruction set architecture, for example, Intel AVX or ARM NEON, capable of performing multiple identical operations on elements packed in a SIMD register) with this problem in mind 2 .



The information on the links is written in the context of Skylake-SP (SKX, Intel Skylake server architecture, which includes Skylake-SP, Skylake-X and Skylake-W), which were the first generation of chips to support AVX-512.



What's the situation with Ice Lake - with the newest chips supporting both the AVX-512 instructions from SKX and the brand new AVX-512 instruction set ? Will we have to look at these new instructions from afar and will never be able to use them due to downlocking?



Read the article to figure it out, or just skip to the Summary section.



AVX-Turbo



We will use the avx-turbo utility to measure the frequency dependence on the number of cores and instruction set. This tool works simply: it executes a given set of instructions on a given number of cores, measuring the frequency reached during the test.



For example, a test avx256_fma_tthat measures the cost of heavy 256-bit instructions with a high ILP (Instruction level parallelism: the amount of parallelism at the inter-instruction level of a superscalar processor) executes the following FMA sequence:



	vfmadd132pd ymm0,ymm10,ymm11
	vfmadd132pd ymm1,ymm10,ymm11
	vfmadd132pd ymm2,ymm10,ymm11
	vfmadd132pd ymm3,ymm10,ymm11
	vfmadd132pd ymm4,ymm10,ymm11
	vfmadd132pd ymm5,ymm10,ymm11
	vfmadd132pd ymm6,ymm10,ymm11
	vfmadd132pd ymm7,ymm10,ymm11
	vfmadd132pd ymm8,ymm10,ymm11
	vfmadd132pd ymm9,ymm10,ymm11
	; repeat 10x for a total of 100 FMAs


In total, we use five tests to test each combination of light and heavy 256-bit and 512-bit instructions, as well as scalar instructions (128-bit SIMD behaves the same as scalar instructions) by typing on the command line:



./avx-turbo --test=scalar_iadd,avx256_iadd,avx512_iadd,avx256_fma_t,avx512_fma_t


Ice Lake Results



I ran avx-turbo as described above on an Ice Lake i5-1035G4, a mid-range Ice Lake client processor running at up to 3.7 GHz. The full results are hidden in the gist , and here I present the most important results for the frequencies obtained (all values โ€‹โ€‹are in GHz):



Instruction set Active nuclei
1 2 3 4
Scalar / 128-bit 3.7 3.6 3.3 3.3
Light 256-bit 3.7 3.6 3.3 3.3
Heavy 256-bit 3.7 3.6 3.3 3.3
Light 512-bit 3.6 3.6 3.3 3.3
Heavy 512-bit 3.6 3.6 3.3 3.3


As expected, the maximum drop in frequency occurs as the number of active cores increases, but look down each column to see the impact on instruction categories. Almost no downlocking occurs along this axis! With only one active core, there is a decrease with wide instructions, and only by a measly 100 MHz: from 3 700 MHz to 3 600 MHz using any 512-bit instructions.



In all other cases, including with several active cores, as well as heavy 256-bit ones, the license downlocking is zero : everything works as fast as with scalar instructions.



Types of licenses



There is another change here. The SKX architecture has three licenses, or categories of downlocking instructions: L0, L1, and L2. Here, in the client ICL, there are only two 3 of them and they do not exactly correspond to the three categories in SKX.



Licenses in SKX correspond to the width and weight of the instructions as follows:



Width Lungs Heavy
Scalar / 128 L0 L0
256 L0 L1
512 L1 L2


In particular, note that the heavy 256-bit instructions are licensed under the same license as the light 512-bit instructions.



In client ICLs, the scheme is as follows:



Width Lungs Heavy
Scalar / 128 L0 L0
256 L0 L0
512 L1 L1


Here heavy 256-bit and light 512-bit instructions are in different categories! In fact, the concept of light versus heavy instructions does not seem to apply here: the categorization is entirely dependent on the width 4 .



So what?



So what of this?



At the very least, this means that we need to change our mental model of the cost of AVX-512 instructions relative to frequencies. Rather than saying that they โ€œusually cause significant downlocking,โ€ this Ice Lake chip can be said to have AVX-512 causing little or no licensed downlocking, and I assume this is true for other Ice Lake client chips as well.



However, this change in our expectations has an important flaw: licensed downlocking is not the onlysource of downlocking. We may also encounter power, heat or current limitations. Some configurations are only able to execute wide SIMD instructions on all cores for a short time and then exceed the operating power limits. In my case, the $ 250 laptop I was testing was extremely poorly cooled, and instead of power limitations, I ran into a heat dissipation limit (100 ยฐ C) just seconds after running heavy instructions on all cores.



However, these other restrictions are qualitatively different from licensing restrictions. Basically they limit 5 on the basis of pay for what you use: If you use broad or heavy instructions (or both), then it only causes a microscopic increase in power or heat generation associated with these instructions alone. This is unlike some licensing effects in which frequency changes occur within the core or whole chip, significantly affecting subsequent execution unrelated to these types of instructions.



Since wide operations are usually less power-intensive than a similar number of narrow operations 6 , it is immediately clear whether the wide operations are worth it ; at least in cases that scale well with increasing width. Be that as it may, this problem is mostly local: it does not depend on the behavior of the neighboring code.



Outcome



Here are my conclusions.



  • The Ice Lake i5-1035 processor demonstrates only 100 MHz licensed downlocking with one active core when executing 512-bit instructions.
  • In all other cases, there is no downlocking.
  • The turbo frequency of execution of 512-bit instructions on all cores is 3.3 GHz, which is 89% of the maximum frequency of execution of scalar operations on one core (3.7 GHz), therefore, within the limits of power and heat dissipation, this chip has a very โ€œflatโ€ frequency addiction.
  • Unlike the SKX architecture, this Ice Lake chip does not use the division into "light"

    and "heavy" instructions for scaling frequencies: FMA operations are performed in the same way as the lighter operations.


That is, there is no need to be afraid of downlocking on client ICLs. Only the future will tell us if this also applies to server-side ICLs.



Discussions and communication



This post can be discussed on on Hacker News .



If you have questions or other feedback, you can leave a comment on the original post . I would also be interested in the results on other ICL chips, for example on the i3 and i7 versions: let me know if you have them and we can get the results.






Notes



  1. Iโ€™m already tired of constantly repeating licensed downlocking , so I will often just use the term โ€œdownlockingโ€, but it should be clear that this is a licensed version of it, and not other types of frequency throttling
  2. Note that Daniel wrote about this much longer , than once.

  3. : , - ( ) , , .
  4. , , ICL FMA : 512- . , 256- : - 2x256- FMA , , 1x512- FMA . , , 512- .
  5. , , , , , . , , , vzeroupper vzeroall.
  6. For example, one 512-bit integer addition will usually be less energy-intensive than the two 256-bit operations required to compute the same result, because the overhead in execution grows non-linearly with increasing width (they include almost everything except the execution itself).



All Articles