😘 📚 👨🏽‍🎓 Radix sort - squeeze everything out ⛄️ 👩🏼‍💼 👲🏻

Greetings to all algorithm lovers. I want to tell you about my research on sorting in general and delve into the consideration of the sorting radix.

As a developer with many years of experience, he increasingly began to face a strange tendency in software development:

Despite the development of the hardware of modern computers and improvements in algorithms, in general, the code performance not only did not increase, but in some places it degraded.

I think this is due to the general idea of giving preference to fast programming using increasingly powerful frameworks and ultra-high-level / scripting languages. Languages like Ruby or Python are incredibly developer friendly. A lot of 'syntactic sugar', I would even say 'Honey', speed up development at times, if not by orders of magnitude, but at what cost. As a user, I am annoyed by the low thermal efficiency of the code, I will simply keep silent about the amount of memory consumed, but the main resource of humanity is time. It disappears without a trace in endless abstractions, is stolen by code analyzers, cleared out by smart garbage collectors. I do not urge to return to the past, abandoning the benefits of modern development, write "expensive" code,I just propose to think about the possible elimination of performance bottlenecks where possible in typical tasks. This can often be achieved by optimizing high-load sections of code.

Sorting can be singled out as one of the basic optimization tasks. The topic is so explored, up and down, that it would seem that it is difficult to find something interesting along the way. However, we will try.

We will not sort small arrays (less than a million elements). Even if it is extremely inefficient to do this, it is quite difficult to feel the drawdowns, since they are leveled by the performance of modern equipment. Large amounts of data (billions of elements) are another matter; the speed of execution varies greatly from a competent selection of an algorithm.

All algorithms based on comparisons generally solve the sorting problem no better than O (n * Log n). For large n, the efficiency quickly decreases and it is not possible to change this situation. This trend can be corrected by abandoning comparison-based methods. The most promising for me is the Radix sort algorithm. Its computational complexity is O (k * n), where k is the number of passes through the array. If n is large enough, but k, on the contrary, is very small, then this algorithm wins over O (n * Log n).

In modern processor architecture, k is almost always reduced to the number of bytes of the sorted number. For example, for DWord (int) k = 4, which is not a lot. This is some kind of "potential pit" of calculations, which is due to several factors:

Processor registers are sharpened for 8-bit operators at the hardware level
The used counting buffer in the algorithm fits into one line of the L1-processor cache. (256 * 4-byte numbers)

You can try to verify the truth of this statement yourself. However, at the current time, dividing by bit is the best option. I do not exclude that when the L1 cache of processors grows to 256 KB, the option to divide along the 2-byte boundary will become more profitable.

An efficient implementation of sorting is not only an algorithm, but also a delicate engineering problem to optimize code.

In this solution, the algorithm consists of several stages:

, ,
,
(LSD),

We apply the LSD algorithm as faster (at least in my version) due to smoother processing with various fluctuations of the input data.

The original fully sorted array is the worst case for the algorithm, since the data will still be fully sorted. On the other hand, Radix sort is extremely effective on random or mixed data.

Sorting a simple array of numbers is rare, usually you need a dictionary of the form: key - value, where the value can be an index or a pointer.

For universalization, we use a structure of the form:

typedef struct TNode {
  //unsigned long long key;
  unsigned int key;
  //unsigned short key;
  //unsigned  char key;
  unsigned int value;
  //unsigned int value1;
  //unsigned int value2;
} TNode;

Naturally, the smaller the bitness of the key, the faster the algorithm works. Since the algorithm does not work with pointers to a structure, but actually drives it in memory. With a minimum of fields, we get high speed. With an increase in the volume of data fields of the structure, the efficiency drops dramatically.

Earlier, I already wrote a note with the implementation of Radix sort in Pascal, but the "segregation" of programming languages is gaining unprecedented rates among the regulars of this resource. Therefore, I decided to partially rewrite the code for this article in 'si' as more ~~'correct'~~a common language in the community of programmers (Although the output of the assembler code generation with Pascal is almost identical in places). When assembling, I recommend using the –O2 or higher key.

In contrast to the previous implementation, using pointers in loop addressing instead of bitwise shift operations, it was possible to get an even larger 1-2% increase in the algorithm speed.

#include <stdio.h>
#include <omp.h>

#include <time.h>
#include <windows.h>
#include <algorithm>
//=============================================================
typedef struct TNode {
  //unsigned long long key;
  unsigned int key;
  //unsigned short key;
  //unsigned  char key;
  unsigned int value;
  //unsigned int value1;
  //unsigned int value2;
} TNode;
//=============================================================
void RSort_step(TNode *source, TNode *dest, unsigned int n, unsigned int *offset, unsigned char sortable_bit)
{
  unsigned char *b = (unsigned char*)&source[n].key + sortable_bit;
  TNode *v = &source[n];
  while (v >= source)
  {
    dest[--offset[*b]] = *v--;
    b -= sizeof(TNode);
  }
}
//=============================================================
void RSort_Node(TNode *m, unsigned int n)
{
  //     
  TNode *m_temp = (TNode*)malloc(sizeof(TNode) * n);

  //   
  unsigned int s[sizeof(m->key) * 256] = {0};
  //      
  unsigned char *b = (unsigned char*)&m[n-1].key;
  while (b >= (unsigned char*)&m[0].key)
  {
    for (unsigned int digit=0; digit< sizeof(m->key); digit++)
    {
      s[*(b+digit)+256*digit]++;
    }
    b -= sizeof(TNode);
  }

  //    
  for (unsigned int i = 1; i < 256; i++)
  {
    for (unsigned int digit=0; digit< sizeof(m->key); digit++)
    {
      s[i+256*digit] += s[i-1+256*digit];
    }
  }

  //         (LSD)
  for (unsigned int digit=0; digit< sizeof(m->key); digit++)
  {
    RSort_step(m, m_temp, n-1, &s[256*digit] ,digit);
    TNode *temp = m;
    m = m_temp;
    m_temp = temp;
  }

  //    ,     
  if (sizeof(m->key)==1)
  {
    TNode *temp = m;
    m = m_temp;
    m_temp = temp;
    memcpy(m, m_temp, n * sizeof(TNode));
  }

  free(m_temp);
}
//=============================================================
int main()
{
  unsigned int n=10000000;

  LARGE_INTEGER frequency;
  LARGE_INTEGER t1, t2, t3, t4;
  double elapsedTime;

  TNode *m1 = (TNode*)malloc(sizeof(TNode) * n);
  TNode *m2 = (TNode*)malloc(sizeof(TNode) * n);
  srand(time(NULL));

  for (unsigned int i=0; i<n; i++)
  {
      m1[i].key = rand()*RAND_MAX+rand();
      m2[i].key = m1[i].key;
  }

  QueryPerformanceFrequency(&frequency);
  QueryPerformanceCounter(&t1);

  RSort_Node(m1, n);

  QueryPerformanceCounter(&t2);
  elapsedTime=(float)(t2.QuadPart-t1.QuadPart)/frequency.QuadPart;
  printf("The RSort:          %.5f seconds\n", elapsedTime);

  QueryPerformanceFrequency(&frequency);
  QueryPerformanceCounter(&t3);

  std::sort(m2, m2+n,[](const TNode &a, const TNode &b){return a.key < b.key;});

  QueryPerformanceCounter(&t4);
  elapsedTime=(float)(t4.QuadPart-t3.QuadPart)/frequency.QuadPart;
  printf("The std::sort:      %.5f seconds\n", elapsedTime);

  for (unsigned int i=0; i<n; i++) 
  {
    if (m1[i].key!=m2[i].key) 
    {
      printf("\n\n!!!!!\n");
      break;
    }
  }

  free(m1);
  free(m2);
  return 0;
}

Pascal

program SORT;
uses
  SysUtils, Windows;
//=============================================================
type TNode = record
     key : Longword;
     //value : Longword;
end;
type ATNode = array of TNode;
//=============================================================
procedure RSort_Node(var m: array of TNode);
//------------------------------------------------------------------------------
    procedure Sort_step(var source, dest: array of TNode; len : Longword; offset: PLongword; const num: Byte);
        var b : ^Byte;
            v : ^TNode;
        begin
          b:=@source[len];
          v:=@source[len];
          inc(b,num);
          while v >= @source do
          begin
            dec(offset[b^]);
            dest[offset[b^]] := v^;
            dec(b,SizeOf(TNode));
            dec(v);
          end;
        end;
//------------------------------------------------------------------------------
var //    ,    
    s: array[0..1023] of Longword =(
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);
    i : Longword;
    b : ^Byte;
    p : Pointer;
begin

  GetMem(p, SizeOf(TNode)*Length(m));

  //  
  b:=@m[High(m)];
  while (b >= @m[0]) do
  begin
    Inc(s[(b+3)^+256*3]);
    Inc(s[(b+2)^+256*2]);
    Inc(s[(b+1)^+256*1]);
    Inc(s[(b+0)^+256*0]);
    dec(b,SizeOf(TNode));
  end;

  //    
  for i := 1 to 255 do                             
  begin
    Inc(s[i+256*0], s[i-1+256*0]);
    Inc(s[i+256*1], s[i-1+256*1]);
    Inc(s[i+256*2], s[i-1+256*2]);
    Inc(s[i+256*3], s[i-1+256*3]);
  end;

  //        
  Sort_step(m, ATNode(p), High(m), @s[256*0], 0);                  
  Sort_step(ATNode(p), m, High(m), @s[256*1], 1);
  Sort_step(m, ATNode(p), High(m), @s[256*2], 2);
  Sort_step(ATNode(p), m, High(m), @s[256*3], 3);

  FreeMem(p);
end;
//=============================================================
 procedure test();
  const
    n = 10000000;
  var
    m1: array of TNode;  
    i,j,k,j1,j2: Longword;

    iCounterPerSec: TLargeInteger;
    T1, T2: TLargeInteger; //       
  begin
    SetLength(m1,n);

    for i := 0 to n - 1 do
    begin
      m1[i].key := Random(65536 * 65536);
    end;  

  QueryPerformanceFrequency(iCounterPerSec);//  
  QueryPerformanceCounter(T11); //   

  RSort_Node(m1);
    
  QueryPerformanceCounter(T22);//  
  WRITELN('1='+FormatFloat('0.0000', (T22 - T11)/iCounterPerSec) + ' sec.');//     

    SetLength(m, 0);
  end;
  //------------------------------------------------------------------------------
begin
  test();
  Readln();
  exit;
end.

I meditated on this code for a very long time, but I could not write better. Perhaps you can tell me how to make this code faster.

And if you want a little faster?

The next logical step, as I saw it, is to use a video card. On the Internet, I saw a lot of reasoning that Radix sort is perfectly parallel on many video card cores (almost the best sorting method is not a video card).

Having succumbed to the temptation to get additional performance, I implemented several sorting options using OpenCL, which is familiar to me. Unfortunately, I do not have the opportunity to check the implementation on top-end video cards, but on my GEFORCE GTX 750 TI, the algorithm lost to the single-threaded implementation on the CPU, due to the fact that the data had to be sent to the video card and then taken back. If you didn't run the data back and forth on the bus, the speed would be acceptable, but still not a fountain. There is one more remark. In OpenCL, threads of execution are not synchronous (workgroups are executed in an arbitrary order, as far as I know, this is not the case in CUDA, correct, who knows), which prevents you from writing more efficient code in this case.

Code from the series: is it possible to create a trolleybus ... processing in OpenCL at Delphi ... but why?

I was too lazy to rewrite in 'C', I post it as it is.

program project1;

uses Cl, SysUtils, Windows, Math;
//------------------------------------------------------------------------------
function OCL_Get_Prog(context: cl_context; pdevice_id: Pcl_device_id; name: PChar; S:AnsiString):cl_kernel;
var   Tex:PCHAR;
      Len:QWORD;
      PLen:PQWORD;
      Prog:cl_program;
      kernel:cl_kernel;
      Ret:cl_int;
begin
   Tex:=@S[1];
   Len:=Length(S);
   PLen:=@LEN;
   prog:=nil;
   kernel:=nil;

     //     
     prog:= clCreateProgramWithSource(context, 1, @Tex, @Len, ret);
     if CL_SUCCESS<>ret then writeln('clCreateProgramWithSource Error ',ret);
     //  
     ret:= clBuildProgram(prog, 1, pdevice_id, nil, nil, nil);
     if CL_SUCCESS<>ret then writeln('clBuildProgram            Error ',ret);
     //      
     kernel:= clCreateKernel(prog, name, ret);
     if  ret<>CL_SUCCESS then writeln('clCreateKernel            Error ',ret);
     //  
     clReleaseProgram(prog);
     OCL_Get_Prog:=kernel;
end;
//------------------------------------------------------------------------------
var
   context:cl_context;
   kernel1, kernel2, kernel3, kernel4 :cl_kernel;
   Ret:cl_int;

   valueSize : QWord =0;
   s0 : PChar;

   platform_id:cl_platform_id;
   ret_num_platforms:cl_uint;
   ret_num_devices:cl_uint;
   device_id:cl_device_id;
   command_queue:cl_command_queue;
   S1,S2,S3,S4 : AnsiString;
   memobj1, memobj2, memobj3 :cl_mem;
   mem:Array of LongWord;
   size:LongWord;
   g_works, l_works :LongInt;

   iCounterPerSec: TLargeInteger;
   T1, T2, T3, T4: TLargeInteger; //     
   i,j,step:LongWord;

procedure exchange_memobj(var a,b:cl_mem);
var c:cl_mem;
begin
  c:=a;
  a:=b;
  b:=c;
end;

begin
   //---------------------------------------------------------------------------
   S1 :=
   //   1 (O  )
   '__kernel void sort1(__global uint* m1) {'+
   ' uint g_id = get_global_id(0);'+
   ' m1[g_id] = 0;'+
   '}';
   //---------------------------------------------------------------------------
   S2 :=
   //   2 (   )
   '__kernel void sort2(__global uint* m1, __global uint* m2, const uint len, const uint digit) {'+
   ' uint g_id = get_global_id(0);'+
   ' uint size = get_global_size(0);'+
   ' uint a = g_id / len;'+
   ' uchar key = (m1[g_id] >> digit);'+
   ' atomic_inc(&m2[key]);'+
   ' atomic_inc(&m2[256 * a + key + 256]);'+
   '}';
   //---------------------------------------------------------------------------
   S3 :=
   //   3 ( )
   '__kernel void sort3(__global uint* m1, const uint len) {'+
   ' uint l_id = get_global_id(0);'+

   ' for (uint i = 0; i < 8; i++) {'+
   '  uint offset = 1 << i;'+
   '  uint add = (l_id>=offset) ? m1[l_id - offset] : 0;'+
   '  barrier(CLK_GLOBAL_MEM_FENCE);'+
   '  m1[l_id] += add;'+
   '  barrier(CLK_GLOBAL_MEM_FENCE);'+
   ' }'+

   ' for (uint i = 1; i < 1024; i++) {'+
   '  m1[i*256+l_id + 256] += m1[(i-1)*256+l_id + 256];'+
   ' }'+

   '  barrier(CLK_GLOBAL_MEM_FENCE);'+

   ' if (l_id>0) {'+
   '  for (uint i = 0; i < 1024; i++) {'+
   '   m1[i*256+l_id + 256] += m1[l_id-1];'+
   '  }'+
   ' }'+

   '}';

   //---------------------------------------------------------------------------
   S4 :=
   //   4
   '__kernel void sort4(__global uint* m1, __global uint* m2, __global uint* m3, const uint digit, const uint len) {'+
   ' uint g_id = get_global_id(0);'+
   ' for (int i = len-1; i >= 0; i--) {'+      //    !
   '  uchar key = (m1[g_id*len+i] >> digit);'+
   '  m2[--m3[g_id*256 + key + 256]] = m1[g_id*len+i];'+
   ' }'+
   '}';
   //---------------------------------------------------------------------------

   //   
   ret := clGetPlatformIDs(1,@platform_id,@ret_num_platforms);
   if CL_SUCCESS<>ret then writeln('clGetPlatformIDs          Error ',ret);
   //   
   ret := clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, @device_id, @ret_num_devices);
   if CL_SUCCESS<>ret then writeln('clGetDeviceIDs            Error ',ret);

   clGetDeviceInfo(device_id, CL_DEVICE_NAME, 0, nil, valueSize);
   GetMem(s0, valueSize);
   clGetDeviceInfo(device_id, CL_DEVICE_NAME, valueSize, s0, valueSize);
   Writeln('DEVICE_NAME:  '+s0);
   FreeMem(s0);

   //    
   context:= clCreateContext(nil, 1, @device_id, nil, nil, ret);
   if CL_SUCCESS<>ret then writeln('clCreateContext           Error ',ret);

   //       
   command_queue := clCreateCommandQueue(context, device_id, 0, ret);
   if CL_SUCCESS<>ret then writeln('clCreateContext           Error ',ret);

   //-------------------------------------------------------------
   kernel1 := OCL_Get_Prog(context, @device_id, 'sort1', S1);
   kernel2 := OCL_Get_Prog(context, @device_id, 'sort2', S2);
   kernel3 := OCL_Get_Prog(context, @device_id, 'sort3', S3);
   kernel4 := OCL_Get_Prog(context, @device_id, 'sort4', S4);
   //-------------------------------------------------------------

   size:=256*256*16*10;

   g_works := size;
   l_works := 256;

   Randomize;
   SetLength(mem, size);

   for i:=0 to size-1 do mem[i]:=random(256*256*256*256);

   //      
   memobj1 := clCreateBuffer(context, CL_MEM_READ_WRITE, size * sizeof(cl_uint), nil, ret);
   if  ret<>CL_SUCCESS then writeln('clCreateBuffer1            Error ',ret);
   memobj2 := clCreateBuffer(context, CL_MEM_READ_WRITE, size * sizeof(cl_uint), nil, ret);
   if  ret<>CL_SUCCESS then writeln('clCreateBuffer2            Error ',ret);
   memobj3 := clCreateBuffer(context, CL_MEM_READ_WRITE, (256+256*1024) * sizeof(cl_uint), nil, ret);
   if  ret<>CL_SUCCESS then writeln('clCreateBuffer3            Error ',ret);

   QueryPerformanceFrequency(iCounterPerSec); //  
   QueryPerformanceCounter(T1); //   

   //    
   ret := clEnqueueWriteBuffer(command_queue, memobj1, CL_TRUE, 0, size * sizeof(cl_int), @mem[0], 0, nil, nil);
   if  ret<>CL_SUCCESS then writeln('clEnqueueWriteBuffer      Error ',ret);

   QueryPerformanceCounter(T2);//  
   Writeln('write       '+FormatFloat('0.0000', (T2 - T1)/iCounterPerSec) + ' second.');//     

   QueryPerformanceFrequency(iCounterPerSec); //  
   QueryPerformanceCounter(T3); //   


   for step:=0 to 3 do
   begin

   //-------------------------------------------------------------
   QueryPerformanceFrequency(iCounterPerSec); //  
   QueryPerformanceCounter(T1); //   
   //-------------------------------------------------------------
   // 1  ( )
   ret:= clSetKernelArg(kernel1, 0, sizeof(cl_mem), @memobj3 );
   if  ret<>CL_SUCCESS then writeln('clSetKernelArg_1_1            Error ',ret);

   i := 256+256*1024;
   ret:= clEnqueueNDRangeKernel(command_queue, kernel1, 1, nil, @i, nil, 0, nil, nil);
   if  ret<>CL_SUCCESS then writeln('clEnqueueNDRangeKernel_1    Error ',ret);
   //-------------------------------------------------------------

   clFinish(command_queue); //      
   QueryPerformanceCounter(T2);  //  
   Writeln('step 1      '+FormatFloat('0.0000', (T2 - T1)/iCounterPerSec) + ' second.');//     

   QueryPerformanceFrequency(iCounterPerSec); //  
   QueryPerformanceCounter(T1);  //   
   //-------------------------------------------------------------
   // 2  ( )

   ret:= clSetKernelArg(kernel2, 0, sizeof(cl_mem), @memobj1 );
   if  ret<>CL_SUCCESS then writeln('clSetKernelArg_2_1            Error ',ret);
   ret:= clSetKernelArg(kernel2, 1, sizeof(cl_mem), @memobj3 );
   if  ret<>CL_SUCCESS then writeln('clSetKernelArg_2_2            Error ',ret);
   j := size div (1024);
   ret:= clSetKernelArg(kernel2, 2, sizeof(j), @j );
   if  ret<>CL_SUCCESS then writeln('clSetKernelArg_2_3            Error ',ret);

   j:=step*8;
   ret:= clSetKernelArg(kernel2, 3, sizeof(j), @j );
   if  ret<>CL_SUCCESS then writeln('clSetKernelArg_2_4            Error ',ret);

   i := size;
   ret:= clEnqueueNDRangeKernel(command_queue, kernel2, 1, nil, @i, nil, 0, nil, nil);
   if  ret<>CL_SUCCESS then writeln('clEnqueueNDRangeKernel_2    Error ',ret);

   //-------------------------------------------------------------
   clFinish(command_queue); //      
   QueryPerformanceCounter(T2);//  
   Writeln('step 2      '+FormatFloat('0.0000', (T2 - T1)/iCounterPerSec) + ' second.');//     
   QueryPerformanceFrequency(iCounterPerSec);//  
   QueryPerformanceCounter(T1); //   
   //-------------------------------------------------------------
   // 3  ( )

   ret:= clSetKernelArg(kernel3, 0, sizeof(cl_mem), @memobj3 );
   if  ret<>CL_SUCCESS then writeln('clSetKernelArg_3_1            Error ',ret);

   j := size;
   ret:= clSetKernelArg(kernel3, 1, sizeof(j), @j );
   if  ret<>CL_SUCCESS then writeln('clSetKernelArg_3_3            Error ',ret);

   j := 256;
   ret:= clEnqueueNDRangeKernel(command_queue, kernel3, 1, nil, @j, @j, 0, nil, nil);
   if  ret<>CL_SUCCESS then writeln('clEnqueueNDRangeKernel_3    Error ',ret);

   //-------------------------------------------------------------
   clFinish(command_queue); //      
   QueryPerformanceCounter(T2);//  
   Writeln('step 3      '+FormatFloat('0.0000', (T2 - T1)/iCounterPerSec) + ' second.');//     
   QueryPerformanceFrequency(iCounterPerSec);//  
   QueryPerformanceCounter(T1); //   
   //-------------------------------------------------------------

   // 4  ()
   ret:= clSetKernelArg(kernel4, 0, sizeof(cl_mem), @memobj1 );
   if  ret<>CL_SUCCESS then writeln('clSetKernelArg_4_1            Error ',ret);
   ret:= clSetKernelArg(kernel4, 1, sizeof(cl_mem), @memobj2 );
   if  ret<>CL_SUCCESS then writeln('clSetKernelArg_4_2            Error ',ret);
   ret:= clSetKernelArg(kernel4, 2, sizeof(cl_mem), @memobj3 );
   if  ret<>CL_SUCCESS then writeln('clSetKernelArg_4_3            Error ',ret);

   j:=step*8;
   ret:= clSetKernelArg(kernel4, 3, sizeof(j), @j );
   if  ret<>CL_SUCCESS then writeln('clSetKernelArg_4_4            Error ',ret);

   j := size div (1024);
   ret:= clSetKernelArg(kernel4, 4, sizeof(j), @j );
   if  ret<>CL_SUCCESS then writeln('clSetKernelArg_4_5            Error ',ret);

   i := (1024);  //  
   ret:= clEnqueueNDRangeKernel(command_queue, kernel4, 1, nil, @i, nil, 0, nil, nil);
   if  ret<>CL_SUCCESS then writeln('clEnqueueNDRangeKernel_4    Error ',ret);
   clFinish(command_queue);
   //        
   //    memobj1
   exchange_memobj(memobj2, memobj1);
   //-------------------------------------------------------------
   clFinish(command_queue); //      
   QueryPerformanceCounter(T2);//  
   Writeln('step 4      '+FormatFloat('0.0000', (T2 - T1)/iCounterPerSec) + ' second.');//     
   //-------------------------------------------------------------
   end;

   QueryPerformanceCounter(T4);//  
   Writeln('all not R/W '+FormatFloat('0.0000', (T4 - T3)/iCounterPerSec) + ' second.');//     


   QueryPerformanceFrequency(iCounterPerSec);//  
   QueryPerformanceCounter(T1); //   

   //    
   ret:= clEnqueueReadBuffer(command_queue, memobj1, CL_TRUE, 0, size * sizeof(cl_int), @mem[0], 0, nil, nil);
   if  ret<>CL_SUCCESS then writeln('clEnqueueReadBuffer       Error ',ret);

   QueryPerformanceCounter(T2);//  
   Writeln('Read        '+FormatFloat('0.0000', (T2 - T1)/iCounterPerSec) + ' second.');//     


    //  
   clReleaseMemObject(memobj1);           //   
   clReleaseMemObject(memobj2);
   clReleaseMemObject(memobj3);
   clReleaseKernel(kernel1);              //   
   clReleaseKernel(kernel2);
   clReleaseKernel(kernel3);
   clReleaseKernel(kernel4);
   clReleaseCommandQueue(command_queue);  //  
   clReleaseContext(context);             //   
   //-------------------------------------------------------------
   SetLength(mem, 0);
   readln;
end.

There is one more untested option left - multithreading. Already using only bare C and the OpenMP library, I decided to find out what effect the use of multiple CPU cores would have.

Initially, the idea was to split the original array into equal parts, transfer them to separate streams, and then merge them (merge sort). The sorting was not bad, but the merge slowed down the whole structure a lot, each gluing is equivalent to an additional pass through the array. The effect was worse than working in one thread. Rejected the implementation as not practical.

As a result, I applied a parallel sorting very similar to the one used on the GPU. With her, everything turned out much better.

Features of parallel processing:

In the current implementation, he avoided thread synchronization problems as much as possible (atomic functions are also not used due to the fact that different threads read different, albeit sometimes neighboring, memory blocks). Threads are not a free thing; the processor spends precious microseconds on their creation and synchronization. By keeping barriers to a minimum, you can save a little. Unfortunately, since all threads use the same L3 cache memory and RAM, the overall gain of the algorithm is not so significant due to Amdahl's law , with an increase in the number of threads.

#include <stdio.h>
#include <omp.h>
//=============================================================
typedef struct TNode {
  //unsigned long long key;
  unsigned int key;
  //unsigned short key;
  //unsigned  char key;
  unsigned int value;
  //unsigned int value1;
  //unsigned int value2;
} TNode;
//=============================================================
void RSort_Parallel(TNode *m, unsigned int n)
{
  //   
  unsigned char threads = omp_get_num_procs();
  //     
  TNode *m_temp = (TNode*)malloc(sizeof(TNode) * n);
  unsigned int *s = (unsigned int*)malloc(sizeof(unsigned int) * 256 * threads);

  #pragma omp parallel num_threads(threads)
  {
    TNode *source = m;
    TNode *dest = m_temp;
    unsigned int l = omp_get_thread_num();
    unsigned int div = n / omp_get_num_threads();
    unsigned int mod = n % omp_get_num_threads();
    unsigned int left_index  = l < mod ? (div + (mod == 0 ? 0 : 1)) * l : n - (omp_get_num_threads() - l) * div;
    unsigned int right_index = left_index + div - (mod > l ? 0 : 1);

    for (unsigned int digit=0; digit< sizeof(m->key); digit++)
    {
      unsigned int s_sum[256] = {0};
      unsigned int s0[256] = {0};
      unsigned char *b1 = (unsigned char*)&source[right_index].key;
      unsigned char *b2 = (unsigned char*)&source[left_index].key;
      while (b1 >= b2)
      {
        s0[*(b1+digit)]++;
        b1 -= sizeof(TNode);
      }
      for (unsigned int i=0; i<256; i++)
      {
        s[i+256*l] = s0[i];
      }

      #pragma omp barrier
      for (unsigned int j=0; j<threads; j++)
      {
        for (unsigned int i=0; i<256; i++)
        {
          s_sum[i] += s[i+256*j];
          if (j<l)
          {
            s0[i] += s[i+256*j];
          }
        }
      }

      for (unsigned int i=1; i<256; i++)
      {
        s_sum[i] += s_sum[i-1];
        s0[i] += s_sum[i-1];
      }
      unsigned char *b = (unsigned char*)&source[right_index].key + digit;
      TNode *v1 = &source[right_index];
      TNode *v2 = &source[left_index];
      while (v1 >= v2)
      {
        dest[--s0[*b]] = *v1--;
        b -= sizeof(TNode);
      }
      #pragma omp barrier
      TNode *temp = source;
      source = dest;
      dest = temp;
    }
  }

  //    ,     
  if (sizeof(m->key)==1)
  {
    memcpy(m, m_temp, n * sizeof(TNode));
  }

  free(s);
  free(m_temp);
}
//=============================================================
int main()
{
  unsigned int n=10000000;

  TNode *m1 = (TNode*)malloc(sizeof(TNode) * n);
  srand(time(NULL));

  for (unsigned int i=0; i<n; i++)
  {
    m1[i].key = rand()*RAND_MAX+rand();
  }

  RSort_Parallel(m1, n);

  free(m1);
  return 0;
}

I hope it was interesting.

Best regards, Yours truly, Rebuilder.

PS

Comparison of algorithms in terms of speed is deliberately not. Comparison results, suggestions and criticism are welcome.

PPS

Radix sort - squeeze everything out

More articles: