The genius of RISC-V microprocessors

image


The wars between RISC and CISC in the late 1990s have long died out, and today it is believed that the difference between RISC and CISC is completely irrelevant. Many people claim that command sets are irrelevant.



However, the command sets are actually important. They place restrictions on the types of optimizations that can be easily added to the microprocessor.



I recently took a closer look at the information on the RISC-V instruction-set architecture (ISA), and here are some of the aspects that really impressed me about the RISC-V ISA:



  1. This is a small and easy-to-learn set of RISC commands. Very preferable for those who are interested in learning about microprocessors.
  2. , , .
  3. CPU ISA RISC-V.
  4. , , RISC-V.


RISC



As I began to understand RISC-V better, I realized that RISC-V was a radical return to what many believed was a bygone era of computing. From the standpoint of design, RISC-V on a machine similar to the movement time to a classical R educed I nstruction S et C omputer (RISC, Β«computer with a set of short commands") early 80s and 90s.



In recent years, many have argued that the division into RISC and CISC no longer makes sense, since so many instructions have been added to RISC processors like ARM, and while many of them are quite complex, that at the current stage it is more a hybrid than a pure RISC processor. Similar considerations have been applied to other RISC processors such as the PowerPC.



RISC-V, on the other hand, is a truly "hardcore" representative of RISC processors. If you read about RISC-V discussions on the Internet, you'll find people claiming that RISC-V was developed by some old-school RISC radicals who refuse to keep up with the times.



Former ARM engineer Erin Shepherd wrote an interesting critique of RISC-V a few years ago :



ISA RISC-V . , .. (, , ) , .


Let me give you a little context briefly. The small code size has a performance advantage because it makes it easier to store executable code inside the processor's high-speed cache.



The criticism here is that the RISC-V designers have focused too much on providing a small instruction set. After all, this is one of the original goals of RISC.



According to Erin, the consequence of this was that a real program would need many more instructions to complete tasks, that is, it would take up more memory space.



Traditionally, for many years, it was believed that more instructions should be added to the RISC processor to make it more similar to CISC. The idea is that more specialized commands can replace the use of multiple common commands.



Command Compression and Macro-Operation Fusion



However, there are two innovations in processor architecture that make this strategy of adding more complex instructions, in many ways, redundant:



  • Compressed Instructions - Instructions are compressed in memory and decompressed in the first stage of the processor.
  • Macro-operation Fusion - Two or more simple instructions are read by the processor and merged into one more complex instruction.


In fact, ARM already uses both of these strategies, and x86 processors use the latter, so RISC-V doesn't do any new tricks here.



However, there is a subtlety here: RISC-V benefits much more from these two strategies for two important reasons:



  1. Compressed commands were added initially. Other architectures, like ARM, thought about this later and screwed them on in a rather hasty way.
  2. This is where RISC's obsession with a small number of unique teams justifies itself. There is simply more space left to add compressed commands.


The second point requires some clarification. In RISC architectures, commands are typically 32 bits wide. These bits need to be used to encode various information. Let's say we have a command like this (there are comments after the semicolon):



ADD x1, x4, x8    ; x1 ← x4 + x8
      
      





It adds the contents of the registers x4



and x8



stores the result in x1



. The number of bits required to encode this instruction depends on the number of registers available. RISC-V and ARM have 32 registers. The number 32 can be expressed in 5 bits:



2⁡ = 32


Since the command needs to specify three different registers, a total of 15 bits (3 Γ— 5) is required to encode the operands (input data for the addition operation).



Therefore, the more features we want to support in the instruction set, the more bits we will take out of the 32 bits available to us. Of course, we can move on to 64-bit commands, but this will consume too much memory, which means performance will suffer.



In an aggressive effort to keep the number of instructions small, RISC-V leaves more room for adding bits to indicate that we are using compressed instructions. If the processor sees that certain bits are set in the command, then it understands that it needs to be interpreted as compressed.



This means that instead of sticking 32 bits of one instruction inside, we can fit two instructions 16 bits wide each. Naturally, not all RISC-V commands can be expressed in 16-bit format. Therefore, a subset of 32-bit instructions are selected based on their usefulness and frequency of use. If uncompressed instructions can receive 3 operands (input data), then compressed instructions can only receive 2 operands. That is, the compressed command ADD



will look like this:



C.ADD x4, x8     ; x4 ← x4 + x8
      
      





RISC-V assembly code uses a prefix C.



to indicate that the command must be assembled into a compressed command.



Essentially, compressed instructions reduce the number of operands. Three operand registers would take 15 bits, leaving only 1 bit to indicate the operation! Thus, when using two operands to indicate the opcode (the operation to be performed), we have 6 bits left.



This is actually close to how x86 assembler works when not enough bits are reserved to use the three operand registers. The x86 processor uses bits to allow, for example, a command to ADD



read incoming data from both memory and registers.



However, the truewe benefit from combining command compression with Macro-operation fusion. When the processor receives a 32-bit word containing two compressed 16-bit instructions, it can merge them into one more complex instruction.



Sounds like nonsense - are we back to where we started?



No, because we are bypassing the need to fill the ISA spec with a bunch of complex instructions (i.e. the strategy ARM is following) Instead, we are, in essence, expressing a whole host of complex commands indirectly , through various combinations of simple commands.



Under normal circumstances, Macro-fusion would cause a problem: although two instructions are replaced with one, they still take up twice as much memory. However, when compressing commands, we do not take up any extra space. We take advantage of both architectures.



Let's look at one of the examples given by Erin Shepherd. In her critical article on ISA RISC-V, she shows a simple function in C. To make it clearer, I took the liberty of rewriting it:



int get_index(int *array, int i) { 
    return array[i];
}
      
      





On x86, this will compile to the following assembly code:



mov eax, [rdi+rsi*4]
ret
      
      





When a function is called in a programming language, arguments are usually passed to the function in a register according to an established order that depends on the instruction set used. On x86, the first argument is placed in a register rdi



, the second in rsi



. By standard, return values ​​should be placed in a register eax



.



The first command multiplies the content rsi



by 4. It contains a variable i



. Why does it multiply? Because it array



consists of integer elements separated by 4 bytes. Therefore, the third element of the array is at offset 3 Γ— 4 = 12 bytes.



Next we add this to rdi



which contains the base address array



... This gives us the final address of the i



th element array



. We read the contents of the memory cell at this address and store it in eax



: task completed.



On ARM, everything happens in a similar way:



LDR r0, [r0, r1, lsl #2]
BX  lr                    ;return
      
      





Here we are not multiplying by 4, but shifting the register r1



2 bits to the left, which is equivalent to multiplying by 4. This is probably a more accurate description of what happens on x86. I doubt it is possible to multiply by anything that is not a multiple of 2, since multiplication is a rather complicated operation, and shifting is inexpensive and easy.



From my description of x86, the rest is anyone's guess. Now let's get to RISC-V, where the real fun begins! (comments start with semicolon)



SLLI a1, a1, 2     ; a1 ← a1 << 2
ADD  a0, a0, a1    ; a0 ← a0 + a1
LW   a0, a0, 0     ; a0 ← [a0 + 0]
RET
      
      





On RISC-V, registers a0



and a1



are simply aliases for x10



and x11



. These are where the first and second arguments of the function call are placed. RET



Is a pseudo-command (shorthand):



JALR x0, 0(ra)     ; sp ← 0 + ra
                   ; x0 ← sp + 4  ignoring result
      
      





JALR



navigates to the address stored in ra



that refers to the return address. ra



Is a pseudonym x1



.



And it all looks absolutely terrible, right? Twice as many commands for a simple and commonly used operation such as performing an index lookup on a table and returning a result.



It really looks bad. This is why Erin Shepherd was extremely critical of the design decisions made by the RISC-V developers. She writes:



RISC-V simplifications make the decoder (i.e. the front-end processor) simpler, but it comes at the cost of more instructions. However, scaling the pipeline width is a tricky task, while decoding a few (or highly) unusual instructions is well studied (the main difficulties arise when determining the command length is nontrivial - due to its infinite prefixes, x86 is a particularly neglected case).


However, thanks to command compression and macro-op fusion, the situation can be changed for the better.



C.SLLI a1, 2      ; a1 ← a1 << 2
C.ADD  a0, a1     ; a0 ← a0 + a1
C.LW   a0, a0, 0  ; a0 ← [a0 + 0]
C.JR   ra
      
      





Now instructions take up exactly the same amount of memory space as the example for ARM.



Okay, now let's do the Macro-op fusion !



One of the conditions for RISC-V to allow merging operations into one is the target register match . This condition is met for commands ADD



and LW



(load word, "load word"). Therefore, the processor will turn them into one instruction.



If this condition was met for SLLI, then we could merge all three commands into one . That is, the processor would see something that resembles a more complex ARM instruction:



LDR r0, [r0, r1, lsl #2]
      
      





But why couldn't we write this complex macro operation directly in the code?



Because ISA does not support such a macro operation! Recall that we have a limited number of bits. Then let's make the commands longer! No, this will take up too much memory and overflow the precious processor cache faster.



However, if instead we issue these long, semi-complex instructions inside the processor, then no problems arise. A processor never has more than a few hundred instructions at the same time. Therefore, if we spend on each command, say, 128 bits, then this will not create difficulties. There will still be enough silicon for everything.



When a decoder receives an ordinary command, it usually turns it into one or more micro-operations. These micro-operations are the instructions that the processor actually works with. They can be very broad and contain a lot of additional useful information. The prefix "micro" sounds ironic, because they are wider. However, in reality, "micro" means that they have a limited number of tasks.



Macro-operation fusing turns the decoder upside down a bit: instead of turning one command into several micro-operations, we take many operations and turn them into one micro-operation.



That is, what is happening in a modern processor may look rather strange:



  1. First, it combines the two teams into one using compression .
  2. He then splits them in two using unpacking .
  3. It then combines them back into one operation using macro-op fusion .


Other commands can be split into several micro-operations, rather than merged. Why do some teams merge while others split? Is there a system in this madness?



A key aspect of the transition to micro-operations is the desired level of complexity:



  • Not too complex, because otherwise they will not be able to complete in a fixed number of clock cycles allocated to each command.
  • Not too simple, because otherwise we'll just waste the processor resources. Doing two micro-operations will take twice as long as doing just one.


It all started with CISC processors. Intel began to split its complex CISC instructions into micro-operations to make them easier to fit into processor pipelines like RISC instructions. However, in subsequent constructs, the developers realized that many CISC teams could be merged into one moderately complex team. If there are fewer commands to execute, the work will be completed faster.



Benefits obtained



We have discussed many details, so now it must be difficult for you to understand what is the meaning of all this work. What is all this compression and merging for? They seem to be doing a lot of unnecessary work.



First, command compression is not at all like zip compression. The word "compression" is a bit misleading because instant compression or decompression of a command is completely simple. No time is wasted on this.



The same applies to macro-operation fusion. Although the process may seem complicated, similar systems are already used in modern microprocessors. Therefore, the costs that all this complexity adds to have already been paid.



However, unlike the designers of ARM, MIPS, and x86, when they started designing their ISA, the creators of RISC-V knew about command compression and macro-ops fusion. Through various tests with the first minimum instruction set, they made two important discoveries:



  1. RISC-V programs typically occupy about the same or less memory space than any other processor architecture. Including x86, which should use memory efficiently, given that it is ISA CISC.
  2. It needs to perform fewer micro-operations than other ISAs.


In fact, by designing the basic instruction set with fusion in mind, they were able to merge enough instructions so that the processor for any program had to perform fewer micro-operations than competing processors.



This prompted the RISC-V development team to redouble efforts to implement macro-operation fusion as a fundamental RISC-V strategy. The RISC-V manual has a lot of notes on what operations you can merge with. It also includes some fixes to make it easier to merge commands found in common patterns.



Small ISA makes it easier for students to learn. This means that it is easier for a student of processor architecture to design their own processor running on RISC-V instructions. It is worth remembering that both command compression and macro-op fusion are optional.



RISC-V has a small fundamental command set that must be implemented. However, all other commands are implemented as part of the extensions. The compressed commands are just an optional extension.



Macro-op fusion is just optimization. It does not change behavior in general, and therefore does not need to be implemented in your own RISC-V processor.



RISC-V design strategy



RISC-V took everything we know about modern processors today and used that knowledge to design ISA processors. For example, we know that:



  • Today's processor cores have a sophisticated branch prediction system.
  • Processor cores are superscalar, that is, they execute many instructions in parallel.
  • To ensure superscalarity, the execution of commands with a change in order (Out-of-Order execution) is used.
  • They have conveyors.


This means that features such as ARM-supported conditional execution are no longer required. ARM's support for this function eats away bits from the instruction format. RISC-V can save these bits.



Conditional execution was originally designed to avoid forks, because they have a bad effect on pipelines. To speed up the work of the processor, it usually receives the following commands in advance, so that immediately after the previous one is executed at the first stage of the processor, it can pick up the next.



With conditional branching, we cannot know ahead of time where the next command will be when we start filling the pipeline. However, a superscalar processor can simply execute both branches in parallel.



It is because of this that RISV-C does not have status registers either, because they create dependencies between commands. The more independent each command is, the easier it is to execute it in parallel with another command.



Basically, the RISC-V strategy is that we can make ISA as simple as possible and the minimal implementation of the RISC-V processor as simple as possible without the need for design decisions that would make it impossible to create a high-performance processor.






Advertising



Our company offers servers not only with Intel CPUs, but also servers with AMD EPYC processors. As with other types of servers, there is a huge selection of operating systems for automatic installation, it is possible to install any OS from your own image. Try it now!






All Articles