We are writing a specification for Nvidia Kepler (CUDA binaries, language version sm_30) for Ghidra

For common processor languages, quite a lot of specifications have already been written for Ghidra, but nothing for graphical ones. It is understandable, because it has its own specifics: predicates, constants through which parameters are passed, including other things inherited from shaders. In addition, the format used to store the code is often proprietary, and you need to reverse it yourself.



In this article, we will figure out what's what for two examples.



The first program is the simplest axpy (analog of hello world for GPGPU). The second helps to understand the implementation of conditions and jumps on the GPU, because everything is different there.



All Nvidia languages ​​use little endian encoding, so immediately copy the bytes from the hex editor to some notepad (for example, Notepad ++) in the opposite order of 8 pieces (the length of the instructions here is constant). Then, through the programmer's calculator (the one from Microsoft is suitable), we translate it into binary code. Next, we look for matches, compose the instruction mask, then the operands. To decode and search for a mask, the hex editor and the cuobjdump disassembler were used, sometimes an assembler is required, as in AMDGPU (because the disassembler is not available there, but this is a topic for a separate article). It works like this: try sequentially inverting all the suspicious bits in the calculator, then we get a new hexadecimal value for the bytes, we substitute them in a binary compiled via nvcc or assembler if it exists, which is not always the case.Then through cuobjdump we check.



I post the source code in the format (mostly in C, without pluses and OOP for closer connection with the machine GPU code), then disasm + bytes at once, because it's more convenient, they just don't need to be swapped.



We copy it to axpy.cu and compile it via cmd: nvcc axpy.cu --cubin --gpu-architecture sm_30

The resulting ELF file named axpy.cubin is unassembled in the same place: cuobjdump axpy.cubin -sass



Example 1:



__global__ void axpy(float param_1, float* param_2, float* param_3) {
unsigned int uVar1 = threadIdx.x;
param_2[uVar1] = param_1 * param_3[uVar1];
}


Dump
/*0000*/
/* 0x22c04282c2804307 */
/*0008*/ MOV R1, c[0x0][0x44];
/* 0x2800400110005de4 */
/*0010*/ S2R R0, SR_TID.X;
/* 0x2c00000084001c04 */
/*0018*/ MOV32I R5, 0x4;
/* 0x1800000010015de2 */
/*0020*/ ISCADD R2.CC, R0, c[0x0][0x150], 0x2;
/* 0x4001400540009c43 */
/*0030*/ LD.E R2, [R2];
/* 0x8400000000209c85 */
/*0038*/ ISCADD R4.CC, R0, c[0x0][0x148], 0x2;
/* 0x4001400520011c43 */
/*0040*/
/* 0x20000002e04283f7 */
/*0048*/ IMAD.U32.U32.HI.X R5, R0, R5, c[0x0][0x14c];
/* 0x208a800530015c43 */
/*0050*/ FMUL R0, R2, c[0x0][0x140];
/* 0x5800400500201c00 */
/*0058*/ ST.E [R4], R0;
/* 0x9400000000401c85 */
/*0060*/ EXIT;
/* 0x8000000000001de7 */
/*0068*/ BRA 0x68;
/* 0x4003ffffe0001de7 */
/*0070*/ NOP;
/* 0x4000000000001de4 */
/*0078*/ NOP;
/* 0x4000000000001de4 */




Decompilation result
void axpy(float param_1,float *param_2,float *param_3) {
  uint uVar1;
  
  uVar1 = *&threadIdx.x;
  param_2[uVar1] = param_3[uVar1] * param_1;
  return;
}




Example 2:



__global__ void predicates(float* param_1, float* param_2) {
    unsigned int uVar1 = threadIdx.x + blockIdx.x * blockDim.x;
    if ((uVar1 > 5) & (uVar1 < 10)) param_1[uVar1] = uVar1;
    else param_2[uVar1] = uVar1;
}


Dump
/*0000*/
/* 0x2272028042823307 */
/*0008*/ MOV R1, c[0x0][0x44];
/* 0x2800400110005de4 */
/*0010*/ S2R R0, SR_TID.X;
/* 0x2c00000084001c04 */
/*0018*/ S2R R3, SR_CTAID.X;
/* 0x2c0000009400dc04 */
/*0020*/ IMAD R0, R3, c[0x0][0x28], R0;
/* 0x20004000a0301ca3 */
/*0028*/ MOV32I R3, 0x4;
/* 0x180000001000dde2 */
/*0030*/ IADD32I R2, R0, -0x6;
/* 0x0bffffffe8009c02 */
/*0038*/ I2F.F32.U32 R4, R0;
/* 0x1800000001211c04 */
/*0040*/
/* 0x22c042e04282c2c7 */
/*0048*/ ISETP.GE.U32.AND P0, PT, R2, 0x4, PT;
/* 0x1b0ec0001021dc03 */
/*0050*/ @P0 ISCADD R2.CC, R0, c[0x0][0x148], 0x2;
/* 0x4001400520008043 */
/*0058*/ @P0 IMAD.U32.U32.HI.X R3, R0, R3, c[0x0][0x14c];
/* 0x208680053000c043 */
/*0060*/ @P0 ST.E [R2], R4;
/* 0x9400000000210085 */
/*0068*/ @P0 EXIT;
/* 0x80000000000001e7 */
/*0070*/ ISCADD R2.CC, R0, c[0x0][0x140], 0x2;
/* 0x4001400500009c43 */
/*0078*/ MOV32I R3, 0x4;
/* 0x180000001000dde2 */
/*0080*/
/* 0x2000000002e04287 */
/*0088*/ IMAD.U32.U32.HI.X R3, R0, R3, c[0x0][0x144];
/* 0x208680051000dc43 */
/*0090*/ ST.E [R2], R4;
/* 0x9400000000211c85 */
/*0098*/ EXIT;
/* 0x8000000000001de7 */
/*00a0*/ BRA 0xa0;
/* 0x4003ffffe0001de7 */
/*00a8*/ NOP;
/* 0x4000000000001de4 */
/*00b0*/ NOP;
/* 0x4000000000001de4 */
/*00b8*/ NOP;
/* 0x4000000000001de4 */




Decompilation result
void predicates(float *param_1,float *param_2) {
  uint uVar1;
  
  uVar1 = *&blockIdx.x * (int)_DAT_constants_00000028 + *&threadIdx.x;
  if (uVar1 - 6 < 4) {
    param_1[uVar1] = (float)uVar1;
    return;
  }
  param_2[uVar1] = (float)uVar1;
  return;
}




It is not hard to guess that the tests were originally designed for machine code so that the compiler had nothing to optimize. For everything else, you would have to manually cancel the optimizations. In complex examples, this may not be possible at all, so for such cases you will have to rely on the decompiler and the frontend.



In general, the rule is this - for testing the frontend, we take any simple (with a minimum of possible optimizations) first suitable (reproducing errors) example. For the rest, the decompiled code will already have optimizations (or only somehow correct it through refactoring). But for now, the main task is at least just correct code that does the same thing as machine code. This is Software Modeling. "Software modeling" itself does not imply refactoring, translation of C to C ++, restoration of classes, and even more so such things as identification of templates.



Now we are looking for patterns for mnemonics, operands and modifiers.



To do this, compare the bits (in binary representation) between suspicious instructions (or strings, if it is more convenient to call them that way). You can also use what other users post in their questions on stackoverflow like "help me understand binary / sass / machine code", use tutorials (including in Chinese) and other resources. So, the main operation number is stored in bits 58-63, but there are additional bits 0-4 (they distinguish between the instructions “I2F”, “ISETP”, “MOV32I”), somewhere in their place 0-2 (to neglect 3- 4 bits in blank instructions, they are marked as "UNK" in the specification).



For registers and constant numbers, you can experiment with the disassembler in order to find all the bits that affect the dump output, like the one laid out under the spoiler. All the fields that I managed to find are in the specification on Github, file CUDA.slaspec, section token.



Then you need to come up with addresses for the registers, again they are on Github. This is necessary because at the micro level, Sleigh registers registers as global variables in space with the type “register_space”, but since their space is not marked as "inferable" (and most likely it cannot be), then in the decompiler they become either local variables (most often with the "Var" interfix, but sometimes there was also a "local" prefix), or parameters (the " param_ "). SP never came in handy, it is needed mostly formally to ensure the decompiler is working. A PC (something like IP from x86) is needed for emulation.



Then there are predicate registers, something like flags, but more of a “general purpose” than for a predetermined goal, such as overflow, (non) equal to zero, etc.

Then, a lock register for modeling a bunch of instructions ISCADD .CC and IMAD.HI, because the first of them in my implementation calculates for itself and for the second, in order to avoid transferring part of the sum to the high 4 bytes, because this will mess up decompilation. But then you need to lock the next register until the IMAD.HI operation is completed. Something similar, i.e. the discrepancy between the official documentation and the expected output of the decompiler was already in the SPU for the same Ghidra.



Then there are special registers, which are so far implemented through cpool. In the future, I plan to replace them with the default symbols for some "inferable" space. These are the same threadIdx, blockIdx.



Then we bind the variables to the fields dest, par0, par1, par2, res. Then there are sub-tables, and after them - what it was all about - the main (root) tables with the main instructions.



Here it is necessary to strictly follow the “operand mnemonics” format, however, relief is given for modifiers, which, nevertheless, must be attached to the mnemonics or to the section with operands. No other formats are allowed, even the same Hexagon DSP asm will have to be adapted to this syntax, which, however, is not very difficult.



The final step will be to write an implementation for instructions in the Pcode firmware language. The only thing I would like to note from the first example is the ISCADD .CC and IMAD.HI instructions, where the first one takes a pointer to registers and dereferences them as pointers for 8 bytes instead of 4. This is done intentionally in order to better adapt to the decompiler and its behavior, despite what is written in the Nvidia documentation about the transfer of part of the amount.



For the second example, it is better to check the box in the decompiler settings opposite the “Simplify predication” inscription. The point is that predicates are one and the same condition for different instructions, in fact, nothing more than the well-known "SIMD", or its next analogue. Those. if the predicate bit is set, then the instructions are executed, moreover, in a row.



You also need to get in the habit of immediately writing an implementation for each instruction, and not just a prototype (operand mnemonics), because there is also a decompiler, an emulator, and other analyzers.

But in general, writing an implementation in Pcode is an even easier task than writing a grammar for a byte decoder. It quickly turned out to fix the implementation for some complex instructions from x86 (and not only), thanks to a very convenient intermediate language, a single middleland (optimizer), 2 backends (mostly C; as an alternative - Java / C #, more like the latter, i.e. (since a goto appears from time to time, but not a labeled break).

In future articles, there will probably also be frontends for managed languages ​​such as DXBC, SPIR-V, they will use Java / C # backend. But so far only machine codes are in the plans. bytecodes require a special approach. Ghidra Help



Project : Pcode Sleigh














All Articles