Greetings to all algorithm lovers. I want to tell you about my research on sorting in general and delve into the consideration of the sorting radix.
As a developer with many years of experience, he increasingly began to face a strange tendency in software development:
Despite the development of the hardware of modern computers and improvements in algorithms, in general, the code performance not only did not increase, but in some places it degraded.
I think this is due to the general idea of ββgiving preference to fast programming using increasingly powerful frameworks and ultra-high-level / scripting languages. Languages ββlike Ruby or Python are incredibly developer friendly. A lot of 'syntactic sugar', I would even say 'Honey', speed up development at times, if not by orders of magnitude, but at what cost. As a user, I am annoyed by the low thermal efficiency of the code, I will simply keep silent about the amount of memory consumed, but the main resource of humanity is time. It disappears without a trace in endless abstractions, is stolen by code analyzers, cleared out by smart garbage collectors. I do not urge to return to the past, abandoning the benefits of modern development, write "expensive" code,I just propose to think about the possible elimination of performance bottlenecks where possible in typical tasks. This can often be achieved by optimizing high-load sections of code.
Sorting can be singled out as one of the basic optimization tasks. The topic is so explored, up and down, that it would seem that it is difficult to find something interesting along the way. However, we will try.
We will not sort small arrays (less than a million elements). Even if it is extremely inefficient to do this, it is quite difficult to feel the drawdowns, since they are leveled by the performance of modern equipment. Large amounts of data (billions of elements) are another matter; the speed of execution varies greatly from a competent selection of an algorithm.
All algorithms based on comparisons generally solve the sorting problem no better than O (n * Log n). For large n, the efficiency quickly decreases and it is not possible to change this situation. This trend can be corrected by abandoning comparison-based methods. The most promising for me is the Radix sort algorithm. Its computational complexity is O (k * n), where k is the number of passes through the array. If n is large enough, but k, on the contrary, is very small, then this algorithm wins over O (n * Log n).
In modern processor architecture, k is almost always reduced to the number of bytes of the sorted number. For example, for DWord (int) k = 4, which is not a lot. This is some kind of "potential pit" of calculations, which is due to several factors:
- Processor registers are sharpened for 8-bit operators at the hardware level
- The used counting buffer in the algorithm fits into one line of the L1-processor cache. (256 * 4-byte numbers)
You can try to verify the truth of this statement yourself. However, at the current time, dividing by bit is the best option. I do not exclude that when the L1 cache of processors grows to 256 KB, the option to divide along the 2-byte boundary will become more profitable.
An efficient implementation of sorting is not only an algorithm, but also a delicate engineering problem to optimize code.
In this solution, the algorithm consists of several stages:
- , ,
- ,
- (LSD),
We apply the LSD algorithm as faster (at least in my version) due to smoother processing with various fluctuations of the input data.
The original fully sorted array is the worst case for the algorithm, since the data will still be fully sorted. On the other hand, Radix sort is extremely effective on random or mixed data.
Sorting a simple array of numbers is rare, usually you need a dictionary of the form: key - value, where the value can be an index or a pointer.
For universalization, we use a structure of the form:
typedef struct TNode {
//unsigned long long key;
unsigned int key;
//unsigned short key;
//unsigned char key;
unsigned int value;
//unsigned int value1;
//unsigned int value2;
} TNode;
Naturally, the smaller the bitness of the key, the faster the algorithm works. Since the algorithm does not work with pointers to a structure, but actually drives it in memory. With a minimum of fields, we get high speed. With an increase in the volume of data fields of the structure, the efficiency drops dramatically.
Earlier, I already wrote a note with the implementation of Radix sort in Pascal, but the "segregation" of programming languages ββis gaining unprecedented rates among the regulars of this resource. Therefore, I decided to partially rewrite the code for this article in 'si' as more
In contrast to the previous implementation, using pointers in loop addressing instead of bitwise shift operations, it was possible to get an even larger 1-2% increase in the algorithm speed.
C
#include <stdio.h>
#include <omp.h>
#include <time.h>
#include <windows.h>
#include <algorithm>
//=============================================================
typedef struct TNode {
//unsigned long long key;
unsigned int key;
//unsigned short key;
//unsigned char key;
unsigned int value;
//unsigned int value1;
//unsigned int value2;
} TNode;
//=============================================================
void RSort_step(TNode *source, TNode *dest, unsigned int n, unsigned int *offset, unsigned char sortable_bit)
{
unsigned char *b = (unsigned char*)&source[n].key + sortable_bit;
TNode *v = &source[n];
while (v >= source)
{
dest[--offset[*b]] = *v--;
b -= sizeof(TNode);
}
}
//=============================================================
void RSort_Node(TNode *m, unsigned int n)
{
//
TNode *m_temp = (TNode*)malloc(sizeof(TNode) * n);
//
unsigned int s[sizeof(m->key) * 256] = {0};
//
unsigned char *b = (unsigned char*)&m[n-1].key;
while (b >= (unsigned char*)&m[0].key)
{
for (unsigned int digit=0; digit< sizeof(m->key); digit++)
{
s[*(b+digit)+256*digit]++;
}
b -= sizeof(TNode);
}
//
for (unsigned int i = 1; i < 256; i++)
{
for (unsigned int digit=0; digit< sizeof(m->key); digit++)
{
s[i+256*digit] += s[i-1+256*digit];
}
}
// (LSD)
for (unsigned int digit=0; digit< sizeof(m->key); digit++)
{
RSort_step(m, m_temp, n-1, &s[256*digit] ,digit);
TNode *temp = m;
m = m_temp;
m_temp = temp;
}
// ,
if (sizeof(m->key)==1)
{
TNode *temp = m;
m = m_temp;
m_temp = temp;
memcpy(m, m_temp, n * sizeof(TNode));
}
free(m_temp);
}
//=============================================================
int main()
{
unsigned int n=10000000;
LARGE_INTEGER frequency;
LARGE_INTEGER t1, t2, t3, t4;
double elapsedTime;
TNode *m1 = (TNode*)malloc(sizeof(TNode) * n);
TNode *m2 = (TNode*)malloc(sizeof(TNode) * n);
srand(time(NULL));
for (unsigned int i=0; i<n; i++)
{
m1[i].key = rand()*RAND_MAX+rand();
m2[i].key = m1[i].key;
}
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&t1);
RSort_Node(m1, n);
QueryPerformanceCounter(&t2);
elapsedTime=(float)(t2.QuadPart-t1.QuadPart)/frequency.QuadPart;
printf("The RSort: %.5f seconds\n", elapsedTime);
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&t3);
std::sort(m2, m2+n,[](const TNode &a, const TNode &b){return a.key < b.key;});
QueryPerformanceCounter(&t4);
elapsedTime=(float)(t4.QuadPart-t3.QuadPart)/frequency.QuadPart;
printf("The std::sort: %.5f seconds\n", elapsedTime);
for (unsigned int i=0; i<n; i++)
{
if (m1[i].key!=m2[i].key)
{
printf("\n\n!!!!!\n");
break;
}
}
free(m1);
free(m2);
return 0;
}
Pascal
program SORT;
uses
SysUtils, Windows;
//=============================================================
type TNode = record
key : Longword;
//value : Longword;
end;
type ATNode = array of TNode;
//=============================================================
procedure RSort_Node(var m: array of TNode);
//------------------------------------------------------------------------------
procedure Sort_step(var source, dest: array of TNode; len : Longword; offset: PLongword; const num: Byte);
var b : ^Byte;
v : ^TNode;
begin
b:=@source[len];
v:=@source[len];
inc(b,num);
while v >= @source do
begin
dec(offset[b^]);
dest[offset[b^]] := v^;
dec(b,SizeOf(TNode));
dec(v);
end;
end;
//------------------------------------------------------------------------------
var // ,
s: array[0..1023] of Longword =(
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);
i : Longword;
b : ^Byte;
p : Pointer;
begin
GetMem(p, SizeOf(TNode)*Length(m));
//
b:=@m[High(m)];
while (b >= @m[0]) do
begin
Inc(s[(b+3)^+256*3]);
Inc(s[(b+2)^+256*2]);
Inc(s[(b+1)^+256*1]);
Inc(s[(b+0)^+256*0]);
dec(b,SizeOf(TNode));
end;
//
for i := 1 to 255 do
begin
Inc(s[i+256*0], s[i-1+256*0]);
Inc(s[i+256*1], s[i-1+256*1]);
Inc(s[i+256*2], s[i-1+256*2]);
Inc(s[i+256*3], s[i-1+256*3]);
end;
//
Sort_step(m, ATNode(p), High(m), @s[256*0], 0);
Sort_step(ATNode(p), m, High(m), @s[256*1], 1);
Sort_step(m, ATNode(p), High(m), @s[256*2], 2);
Sort_step(ATNode(p), m, High(m), @s[256*3], 3);
FreeMem(p);
end;
//=============================================================
procedure test();
const
n = 10000000;
var
m1: array of TNode;
i,j,k,j1,j2: Longword;
iCounterPerSec: TLargeInteger;
T1, T2: TLargeInteger; //
begin
SetLength(m1,n);
for i := 0 to n - 1 do
begin
m1[i].key := Random(65536 * 65536);
end;
QueryPerformanceFrequency(iCounterPerSec);//
QueryPerformanceCounter(T11); //
RSort_Node(m1);
QueryPerformanceCounter(T22);//
WRITELN('1='+FormatFloat('0.0000', (T22 - T11)/iCounterPerSec) + ' sec.');//
SetLength(m, 0);
end;
//------------------------------------------------------------------------------
begin
test();
Readln();
exit;
end.
I meditated on this code for a very long time, but I could not write better. Perhaps you can tell me how to make this code faster.
And if you want a little faster?
The next logical step, as I saw it, is to use a video card. On the Internet, I saw a lot of reasoning that Radix sort is perfectly parallel on many video card cores (almost the best sorting method is not a video card).
Having succumbed to the temptation to get additional performance, I implemented several sorting options using OpenCL, which is familiar to me. Unfortunately, I do not have the opportunity to check the implementation on top-end video cards, but on my GEFORCE GTX 750 TI, the algorithm lost to the single-threaded implementation on the CPU, due to the fact that the data had to be sent to the video card and then taken back. If you didn't run the data back and forth on the bus, the speed would be acceptable, but still not a fountain. There is one more remark. In OpenCL, threads of execution are not synchronous (workgroups are executed in an arbitrary order, as far as I know, this is not the case in CUDA, correct, who knows), which prevents you from writing more efficient code in this case.
Code from the series: is it possible to create a trolleybus ... processing in OpenCL at Delphi ... but why?
I was too lazy to rewrite in 'C', I post it as it is.
program project1;
uses Cl, SysUtils, Windows, Math;
//------------------------------------------------------------------------------
function OCL_Get_Prog(context: cl_context; pdevice_id: Pcl_device_id; name: PChar; S:AnsiString):cl_kernel;
var Tex:PCHAR;
Len:QWORD;
PLen:PQWORD;
Prog:cl_program;
kernel:cl_kernel;
Ret:cl_int;
begin
Tex:=@S[1];
Len:=Length(S);
PLen:=@LEN;
prog:=nil;
kernel:=nil;
//
prog:= clCreateProgramWithSource(context, 1, @Tex, @Len, ret);
if CL_SUCCESS<>ret then writeln('clCreateProgramWithSource Error ',ret);
//
ret:= clBuildProgram(prog, 1, pdevice_id, nil, nil, nil);
if CL_SUCCESS<>ret then writeln('clBuildProgram Error ',ret);
//
kernel:= clCreateKernel(prog, name, ret);
if ret<>CL_SUCCESS then writeln('clCreateKernel Error ',ret);
//
clReleaseProgram(prog);
OCL_Get_Prog:=kernel;
end;
//------------------------------------------------------------------------------
var
context:cl_context;
kernel1, kernel2, kernel3, kernel4 :cl_kernel;
Ret:cl_int;
valueSize : QWord =0;
s0 : PChar;
platform_id:cl_platform_id;
ret_num_platforms:cl_uint;
ret_num_devices:cl_uint;
device_id:cl_device_id;
command_queue:cl_command_queue;
S1,S2,S3,S4 : AnsiString;
memobj1, memobj2, memobj3 :cl_mem;
mem:Array of LongWord;
size:LongWord;
g_works, l_works :LongInt;
iCounterPerSec: TLargeInteger;
T1, T2, T3, T4: TLargeInteger; //
i,j,step:LongWord;
procedure exchange_memobj(var a,b:cl_mem);
var c:cl_mem;
begin
c:=a;
a:=b;
b:=c;
end;
begin
//---------------------------------------------------------------------------
S1 :=
// 1 (O )
'__kernel void sort1(__global uint* m1) {'+
' uint g_id = get_global_id(0);'+
' m1[g_id] = 0;'+
'}';
//---------------------------------------------------------------------------
S2 :=
// 2 ( )
'__kernel void sort2(__global uint* m1, __global uint* m2, const uint len, const uint digit) {'+
' uint g_id = get_global_id(0);'+
' uint size = get_global_size(0);'+
' uint a = g_id / len;'+
' uchar key = (m1[g_id] >> digit);'+
' atomic_inc(&m2[key]);'+
' atomic_inc(&m2[256 * a + key + 256]);'+
'}';
//---------------------------------------------------------------------------
S3 :=
// 3 ( )
'__kernel void sort3(__global uint* m1, const uint len) {'+
' uint l_id = get_global_id(0);'+
' for (uint i = 0; i < 8; i++) {'+
' uint offset = 1 << i;'+
' uint add = (l_id>=offset) ? m1[l_id - offset] : 0;'+
' barrier(CLK_GLOBAL_MEM_FENCE);'+
' m1[l_id] += add;'+
' barrier(CLK_GLOBAL_MEM_FENCE);'+
' }'+
' for (uint i = 1; i < 1024; i++) {'+
' m1[i*256+l_id + 256] += m1[(i-1)*256+l_id + 256];'+
' }'+
' barrier(CLK_GLOBAL_MEM_FENCE);'+
' if (l_id>0) {'+
' for (uint i = 0; i < 1024; i++) {'+
' m1[i*256+l_id + 256] += m1[l_id-1];'+
' }'+
' }'+
'}';
//---------------------------------------------------------------------------
S4 :=
// 4
'__kernel void sort4(__global uint* m1, __global uint* m2, __global uint* m3, const uint digit, const uint len) {'+
' uint g_id = get_global_id(0);'+
' for (int i = len-1; i >= 0; i--) {'+ // !
' uchar key = (m1[g_id*len+i] >> digit);'+
' m2[--m3[g_id*256 + key + 256]] = m1[g_id*len+i];'+
' }'+
'}';
//---------------------------------------------------------------------------
//
ret := clGetPlatformIDs(1,@platform_id,@ret_num_platforms);
if CL_SUCCESS<>ret then writeln('clGetPlatformIDs Error ',ret);
//
ret := clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, @device_id, @ret_num_devices);
if CL_SUCCESS<>ret then writeln('clGetDeviceIDs Error ',ret);
clGetDeviceInfo(device_id, CL_DEVICE_NAME, 0, nil, valueSize);
GetMem(s0, valueSize);
clGetDeviceInfo(device_id, CL_DEVICE_NAME, valueSize, s0, valueSize);
Writeln('DEVICE_NAME: '+s0);
FreeMem(s0);
//
context:= clCreateContext(nil, 1, @device_id, nil, nil, ret);
if CL_SUCCESS<>ret then writeln('clCreateContext Error ',ret);
//
command_queue := clCreateCommandQueue(context, device_id, 0, ret);
if CL_SUCCESS<>ret then writeln('clCreateContext Error ',ret);
//-------------------------------------------------------------
kernel1 := OCL_Get_Prog(context, @device_id, 'sort1', S1);
kernel2 := OCL_Get_Prog(context, @device_id, 'sort2', S2);
kernel3 := OCL_Get_Prog(context, @device_id, 'sort3', S3);
kernel4 := OCL_Get_Prog(context, @device_id, 'sort4', S4);
//-------------------------------------------------------------
size:=256*256*16*10;
g_works := size;
l_works := 256;
Randomize;
SetLength(mem, size);
for i:=0 to size-1 do mem[i]:=random(256*256*256*256);
//
memobj1 := clCreateBuffer(context, CL_MEM_READ_WRITE, size * sizeof(cl_uint), nil, ret);
if ret<>CL_SUCCESS then writeln('clCreateBuffer1 Error ',ret);
memobj2 := clCreateBuffer(context, CL_MEM_READ_WRITE, size * sizeof(cl_uint), nil, ret);
if ret<>CL_SUCCESS then writeln('clCreateBuffer2 Error ',ret);
memobj3 := clCreateBuffer(context, CL_MEM_READ_WRITE, (256+256*1024) * sizeof(cl_uint), nil, ret);
if ret<>CL_SUCCESS then writeln('clCreateBuffer3 Error ',ret);
QueryPerformanceFrequency(iCounterPerSec); //
QueryPerformanceCounter(T1); //
//
ret := clEnqueueWriteBuffer(command_queue, memobj1, CL_TRUE, 0, size * sizeof(cl_int), @mem[0], 0, nil, nil);
if ret<>CL_SUCCESS then writeln('clEnqueueWriteBuffer Error ',ret);
QueryPerformanceCounter(T2);//
Writeln('write '+FormatFloat('0.0000', (T2 - T1)/iCounterPerSec) + ' second.');//
QueryPerformanceFrequency(iCounterPerSec); //
QueryPerformanceCounter(T3); //
for step:=0 to 3 do
begin
//-------------------------------------------------------------
QueryPerformanceFrequency(iCounterPerSec); //
QueryPerformanceCounter(T1); //
//-------------------------------------------------------------
// 1 ( )
ret:= clSetKernelArg(kernel1, 0, sizeof(cl_mem), @memobj3 );
if ret<>CL_SUCCESS then writeln('clSetKernelArg_1_1 Error ',ret);
i := 256+256*1024;
ret:= clEnqueueNDRangeKernel(command_queue, kernel1, 1, nil, @i, nil, 0, nil, nil);
if ret<>CL_SUCCESS then writeln('clEnqueueNDRangeKernel_1 Error ',ret);
//-------------------------------------------------------------
clFinish(command_queue); //
QueryPerformanceCounter(T2); //
Writeln('step 1 '+FormatFloat('0.0000', (T2 - T1)/iCounterPerSec) + ' second.');//
QueryPerformanceFrequency(iCounterPerSec); //
QueryPerformanceCounter(T1); //
//-------------------------------------------------------------
// 2 ( )
ret:= clSetKernelArg(kernel2, 0, sizeof(cl_mem), @memobj1 );
if ret<>CL_SUCCESS then writeln('clSetKernelArg_2_1 Error ',ret);
ret:= clSetKernelArg(kernel2, 1, sizeof(cl_mem), @memobj3 );
if ret<>CL_SUCCESS then writeln('clSetKernelArg_2_2 Error ',ret);
j := size div (1024);
ret:= clSetKernelArg(kernel2, 2, sizeof(j), @j );
if ret<>CL_SUCCESS then writeln('clSetKernelArg_2_3 Error ',ret);
j:=step*8;
ret:= clSetKernelArg(kernel2, 3, sizeof(j), @j );
if ret<>CL_SUCCESS then writeln('clSetKernelArg_2_4 Error ',ret);
i := size;
ret:= clEnqueueNDRangeKernel(command_queue, kernel2, 1, nil, @i, nil, 0, nil, nil);
if ret<>CL_SUCCESS then writeln('clEnqueueNDRangeKernel_2 Error ',ret);
//-------------------------------------------------------------
clFinish(command_queue); //
QueryPerformanceCounter(T2);//
Writeln('step 2 '+FormatFloat('0.0000', (T2 - T1)/iCounterPerSec) + ' second.');//
QueryPerformanceFrequency(iCounterPerSec);//
QueryPerformanceCounter(T1); //
//-------------------------------------------------------------
// 3 ( )
ret:= clSetKernelArg(kernel3, 0, sizeof(cl_mem), @memobj3 );
if ret<>CL_SUCCESS then writeln('clSetKernelArg_3_1 Error ',ret);
j := size;
ret:= clSetKernelArg(kernel3, 1, sizeof(j), @j );
if ret<>CL_SUCCESS then writeln('clSetKernelArg_3_3 Error ',ret);
j := 256;
ret:= clEnqueueNDRangeKernel(command_queue, kernel3, 1, nil, @j, @j, 0, nil, nil);
if ret<>CL_SUCCESS then writeln('clEnqueueNDRangeKernel_3 Error ',ret);
//-------------------------------------------------------------
clFinish(command_queue); //
QueryPerformanceCounter(T2);//
Writeln('step 3 '+FormatFloat('0.0000', (T2 - T1)/iCounterPerSec) + ' second.');//
QueryPerformanceFrequency(iCounterPerSec);//
QueryPerformanceCounter(T1); //
//-------------------------------------------------------------
// 4 ()
ret:= clSetKernelArg(kernel4, 0, sizeof(cl_mem), @memobj1 );
if ret<>CL_SUCCESS then writeln('clSetKernelArg_4_1 Error ',ret);
ret:= clSetKernelArg(kernel4, 1, sizeof(cl_mem), @memobj2 );
if ret<>CL_SUCCESS then writeln('clSetKernelArg_4_2 Error ',ret);
ret:= clSetKernelArg(kernel4, 2, sizeof(cl_mem), @memobj3 );
if ret<>CL_SUCCESS then writeln('clSetKernelArg_4_3 Error ',ret);
j:=step*8;
ret:= clSetKernelArg(kernel4, 3, sizeof(j), @j );
if ret<>CL_SUCCESS then writeln('clSetKernelArg_4_4 Error ',ret);
j := size div (1024);
ret:= clSetKernelArg(kernel4, 4, sizeof(j), @j );
if ret<>CL_SUCCESS then writeln('clSetKernelArg_4_5 Error ',ret);
i := (1024); //
ret:= clEnqueueNDRangeKernel(command_queue, kernel4, 1, nil, @i, nil, 0, nil, nil);
if ret<>CL_SUCCESS then writeln('clEnqueueNDRangeKernel_4 Error ',ret);
clFinish(command_queue);
//
// memobj1
exchange_memobj(memobj2, memobj1);
//-------------------------------------------------------------
clFinish(command_queue); //
QueryPerformanceCounter(T2);//
Writeln('step 4 '+FormatFloat('0.0000', (T2 - T1)/iCounterPerSec) + ' second.');//
//-------------------------------------------------------------
end;
QueryPerformanceCounter(T4);//
Writeln('all not R/W '+FormatFloat('0.0000', (T4 - T3)/iCounterPerSec) + ' second.');//
QueryPerformanceFrequency(iCounterPerSec);//
QueryPerformanceCounter(T1); //
//
ret:= clEnqueueReadBuffer(command_queue, memobj1, CL_TRUE, 0, size * sizeof(cl_int), @mem[0], 0, nil, nil);
if ret<>CL_SUCCESS then writeln('clEnqueueReadBuffer Error ',ret);
QueryPerformanceCounter(T2);//
Writeln('Read '+FormatFloat('0.0000', (T2 - T1)/iCounterPerSec) + ' second.');//
//
clReleaseMemObject(memobj1); //
clReleaseMemObject(memobj2);
clReleaseMemObject(memobj3);
clReleaseKernel(kernel1); //
clReleaseKernel(kernel2);
clReleaseKernel(kernel3);
clReleaseKernel(kernel4);
clReleaseCommandQueue(command_queue); //
clReleaseContext(context); //
//-------------------------------------------------------------
SetLength(mem, 0);
readln;
end.
There is one more untested option left - multithreading. Already using only bare C and the OpenMP library, I decided to find out what effect the use of multiple CPU cores would have.
Initially, the idea was to split the original array into equal parts, transfer them to separate streams, and then merge them (merge sort). The sorting was not bad, but the merge slowed down the whole structure a lot, each gluing is equivalent to an additional pass through the array. The effect was worse than working in one thread. Rejected the implementation as not practical.
As a result, I applied a parallel sorting very similar to the one used on the GPU. With her, everything turned out much better.
Features of parallel processing:
In the current implementation, he avoided thread synchronization problems as much as possible (atomic functions are also not used due to the fact that different threads read different, albeit sometimes neighboring, memory blocks). Threads are not a free thing; the processor spends precious microseconds on their creation and synchronization. By keeping barriers to a minimum, you can save a little. Unfortunately, since all threads use the same L3 cache memory and RAM, the overall gain of the algorithm is not so significant due to Amdahl's law , with an increase in the number of threads.
C
#include <stdio.h>
#include <omp.h>
//=============================================================
typedef struct TNode {
//unsigned long long key;
unsigned int key;
//unsigned short key;
//unsigned char key;
unsigned int value;
//unsigned int value1;
//unsigned int value2;
} TNode;
//=============================================================
void RSort_Parallel(TNode *m, unsigned int n)
{
//
unsigned char threads = omp_get_num_procs();
//
TNode *m_temp = (TNode*)malloc(sizeof(TNode) * n);
unsigned int *s = (unsigned int*)malloc(sizeof(unsigned int) * 256 * threads);
#pragma omp parallel num_threads(threads)
{
TNode *source = m;
TNode *dest = m_temp;
unsigned int l = omp_get_thread_num();
unsigned int div = n / omp_get_num_threads();
unsigned int mod = n % omp_get_num_threads();
unsigned int left_index = l < mod ? (div + (mod == 0 ? 0 : 1)) * l : n - (omp_get_num_threads() - l) * div;
unsigned int right_index = left_index + div - (mod > l ? 0 : 1);
for (unsigned int digit=0; digit< sizeof(m->key); digit++)
{
unsigned int s_sum[256] = {0};
unsigned int s0[256] = {0};
unsigned char *b1 = (unsigned char*)&source[right_index].key;
unsigned char *b2 = (unsigned char*)&source[left_index].key;
while (b1 >= b2)
{
s0[*(b1+digit)]++;
b1 -= sizeof(TNode);
}
for (unsigned int i=0; i<256; i++)
{
s[i+256*l] = s0[i];
}
#pragma omp barrier
for (unsigned int j=0; j<threads; j++)
{
for (unsigned int i=0; i<256; i++)
{
s_sum[i] += s[i+256*j];
if (j<l)
{
s0[i] += s[i+256*j];
}
}
}
for (unsigned int i=1; i<256; i++)
{
s_sum[i] += s_sum[i-1];
s0[i] += s_sum[i-1];
}
unsigned char *b = (unsigned char*)&source[right_index].key + digit;
TNode *v1 = &source[right_index];
TNode *v2 = &source[left_index];
while (v1 >= v2)
{
dest[--s0[*b]] = *v1--;
b -= sizeof(TNode);
}
#pragma omp barrier
TNode *temp = source;
source = dest;
dest = temp;
}
}
// ,
if (sizeof(m->key)==1)
{
memcpy(m, m_temp, n * sizeof(TNode));
}
free(s);
free(m_temp);
}
//=============================================================
int main()
{
unsigned int n=10000000;
TNode *m1 = (TNode*)malloc(sizeof(TNode) * n);
srand(time(NULL));
for (unsigned int i=0; i<n; i++)
{
m1[i].key = rand()*RAND_MAX+rand();
}
RSort_Parallel(m1, n);
free(m1);
return 0;
}
I hope it was interesting.
Best regards, Yours truly, Rebuilder.
PS
Comparison of algorithms in terms of speed is deliberately not. Comparison results, suggestions and criticism are welcome.
PPS