Introduction
This article describes a stable non-recursive adaptive merge sort called quadsort.
Quadruple exchange
Quadsort is based on a quadruple exchange. Traditionally, most sorting algorithms are based on binary exchange, where two variables are sorted using a third temporary variable. It usually looks like this:
if (val[0] > val[1])
{
tmp[0] = val[0];
val[0] = val[1];
val[1] = tmp[0];
}
In a quadruple exchange, sorting occurs using four substitution variables (swap). In the first step, the four variables are partially sorted into four swap variables, in the second step, they are completely sorted back into the four original variables.
This process is shown in the diagram above.
After the first round of sorting, one if check determines if the four swap variables are sorted in order. If so, the exchange ends immediately. It then checks to see if the swap variables are sorted in reverse order. If so, the sort ends immediately. If both tests fail, then the final location is known and two tests remain to determine the final order.
This eliminates one unnecessary comparison for sequences that are in order. At the same time, one additional comparison is created for random sequences. However, in the real world, we rarely compare truly random data. So when the data is ordered rather than messy, this shift in probability is beneficial.
There is also an overall performance improvement due to the elimination of wasteful swap. In C, a basic quadruple swap looks like this:
if (val[0] > val[1])
{
tmp[0] = val[1];
tmp[1] = val[0];
}
else
{
tmp[0] = val[0];
tmp[1] = val[1];
}
if (val[2] > val[3])
{
tmp[2] = val[3];
tmp[3] = val[2];
}
else
{
tmp[2] = val[2];
tmp[3] = val[3];
}
if (tmp[1] <= tmp[2])
{
val[0] = tmp[0];
val[1] = tmp[1];
val[2] = tmp[2];
val[3] = tmp[3];
}
else if (tmp[0] > tmp[3])
{
val[0] = tmp[2];
val[1] = tmp[3];
val[2] = tmp[0];
val[3] = tmp[1];
}
else
{
if (tmp[0] <= tmp[2])
{
val[0] = tmp[0];
val[1] = tmp[2];
}
else
{
val[0] = tmp[2];
val[1] = tmp[0];
}
if (tmp[1] <= tmp[3])
{
val[2] = tmp[1];
val[3] = tmp[3];
}
else
{
val[2] = tmp[3];
val[3] = tmp[1];
}
}
In case the array cannot be perfectly divided into four parts, the tail of 1-3 elements is sorted using the traditional binary exchange.
The quadruple exchange described above is implemented in quadsort.
Quadruple merge
At the first stage, the quadruple exchange pre-sorts the array into four-element blocks, as described above.
The second stage uses a similar approach for detecting ordered and reversed orders, but in blocks of 4, 16, 64, or more elements, this stage is treated like a traditional merge sort.
This can be represented as follows:
main memory: AAAA BBBB CCCC DDDD
swap memory: ABABABAB CDCDCDCD
main memory: ABCDABCDABCDABCD
The first line uses a quadruple swap to create four blocks of four sorted elements each. The second line uses a quad merge to combine into two blocks of eight sorted items each in swap memory. In the last line, the blocks are merged back into main memory, and we are left with one block of 16 sorted elements. Below is the visualization.
These operations actually require doubling the swap memory. More on this later.
Skip
Another difference is that due to the increased cost of merge operations, it is beneficial to check if four blocks are in forward or reverse order.
In case the four blocks are in order, the merge operation is skipped, since it is meaningless. However, this requires an additional if check, and for randomly sorted data, this if check becomes increasingly unlikely as the block size increases. Fortunately, the frequency of this if check is quadrupled each cycle, while the potential benefit quadruples each cycle.
In case the four blocks are in reverse order, a stable in-place swap is performed.
In the event that only two of the four blocks are ordered or are in reverse order, comparisons in the merge itself are unnecessary and are subsequently omitted. The data still needs to be copied into swap memory, but this is a less computational procedure.
This allows the quadsort algorithm to sort sequences in forward and backward order using
n
comparisons instead of n * log n
comparisons.
Boundary checks
Another problem with traditional merge sort is that it wastes resources on unnecessary boundary checks. It looks like this:
while (a <= a_max && b <= b_max)
if (a <= b)
[insert a++]
else
[insert b++]
For optimization, our algorithm compares the last element of sequence A with the last element of sequence B. If the last element of sequence A is less than the last element of sequence B, then we know that the test
b < b_max
will always be false, because sequence A is the first to completely merge.
Likewise, if the last element of sequence A is greater than the last element of sequence B, we know that the test
a < a_max
will always be false. It looks like this:
if (val[a_max] <= val[b_max])
while (a <= a_max)
{
while (a > b)
[insert b++]
[insert a++]
}
else
while (b <= b_max)
{
while (a <= b)
[insert a++]
[insert b++]
}
Fusion tail
When you sort a 65 element array, you end up with a 64 element sorted array and a one element sorted array at the end. This will not result in an additional swap (exchange) operation if the whole sequence is in order. In any case, for this, the program must choose the optimal array size (64, 256 or 1024).
Another problem is that the sub-optimal size of the array leads to unnecessary exchange operations. To work around these two problems, the quadruple merge procedure is aborted when the block size reaches 1/8 of the array size and the rest of the array is sorted using tail merge. The main advantage of tail fusion is that it allows the quadsort swap to be reduced to n / 2 without significantly impacting performance.
Big O
Name | Best | Middle | Worst | Stable | Memory |
---|---|---|---|---|---|
quadsort | n | n log n | n log n | Yes | n |
When the data is already sorted or sorted in reverse order, quadsort makes n comparisons.
Cache
Since quadsort uses n / 2 swaps, cache usage is not as ideal as in-place sort. However, sorting the random data in place results in sub-optimal exchanges. Based on my benchmarks, quadsort is always faster than in-place sort, as long as the array does not overflow the L1 cache, which can be as high as 64KB on modern processors.
Wolfsort
Wolfsort is a hybrid radixsort / quadsort sorting algorithm that improves performance when working with random data. This is basically a proof of concept that only works with unsigned 32 and 64 bit integers.
Visualization
The visualization below runs four tests. The first test is based on a random distribution, the second on an ascending distribution, the third on a descending distribution, and the fourth on an ascending distribution with a random tail.
The top half shows swap and the bottom half shows main memory. Colors vary for skip, swap, merge and copy operations.
Benchmark: quadsort, std :: stable_sort, timsort, pdqsort and wolfsort
The following benchmark was run on WSL gcc configuration version 7.4.0 (Ubuntu 7.4.0-1ubuntu1 ~ 18.04.1). The source code is compiled by the team
g++ -O3 quadsort.cpp
. Each test is run a hundred times and only the best run is reported.
The abscissa is the execution time.
Quadsort datasheet, std :: stable_sort, timsort, pdqsort and wolfsort
quadsort | 1000000 | i32 | 0.070469 | 0.070635 | ||
stablesort | 1000000 | i32 | 0.073865 | 0.074078 | ||
timsort | 1000000 | i32 | 0.089192 | 0.089301 | ||
pdqsort | 1000000 | i32 | 0.029911 | 0.029948 | ||
wolfsort | 1000000 | i32 | 0.017359 | 0.017744 | ||
quadsort | 1000000 | i32 | 0.000485 | 0.000511 | ||
stablesort | 1000000 | i32 | 0.010188 | 0.010261 | ||
timsort | 1000000 | i32 | 0.000331 | 0.000342 | ||
pdqsort | 1000000 | i32 | 0.000863 | 0.000875 | ||
wolfsort | 1000000 | i32 | 0.000539 | 0.000579 | ||
quadsort | 1000000 | i32 | 0.008901 | 0.008934 | () | |
stablesort | 1000000 | i32 | 0.017093 | 0.017275 | () | |
timsort | 1000000 | i32 | 0.008615 | 0.008674 | () | |
pdqsort | 1000000 | i32 | 0.047785 | 0.047921 | () | |
wolfsort | 1000000 | i32 | 0.012490 | 0.012554 | () | |
quadsort | 1000000 | i32 | 0.038260 | 0.038343 | ||
stablesort | 1000000 | i32 | 0.042292 | 0.042388 | ||
timsort | 1000000 | i32 | 0.055855 | 0.055967 | ||
pdqsort | 1000000 | i32 | 0.008093 | 0.008168 | ||
wolfsort | 1000000 | i32 | 0.038320 | 0.038417 | ||
quadsort | 1000000 | i32 | 0.000559 | 0.000576 | ||
stablesort | 1000000 | i32 | 0.010343 | 0.010438 | ||
timsort | 1000000 | i32 | 0.000891 | 0.000900 | ||
pdqsort | 1000000 | i32 | 0.001882 | 0.001897 | ||
wolfsort | 1000000 | i32 | 0.000604 | 0.000625 | ||
quadsort | 1000000 | i32 | 0.006174 | 0.006245 | ||
stablesort | 1000000 | i32 | 0.014679 | 0.014767 | ||
timsort | 1000000 | i32 | 0.006419 | 0.006468 | ||
pdqsort | 1000000 | i32 | 0.016603 | 0.016629 | ||
wolfsort | 1000000 | i32 | 0.006264 | 0.006329 | ||
quadsort | 1000000 | i32 | 0.018675 | 0.018729 | ||
stablesort | 1000000 | i32 | 0.026384 | 0.026508 | ||
timsort | 1000000 | i32 | 0.023226 | 0.023395 | ||
pdqsort | 1000000 | i32 | 0.028599 | 0.028674 | ||
wolfsort | 1000000 | i32 | 0.009517 | 0.009631 | ||
quadsort | 1000000 | i32 | 0.037593 | 0.037679 | ||
stablesort | 1000000 | i32 | 0.043755 | 0.043899 | ||
timsort | 1000000 | i32 | 0.047008 | 0.047112 | ||
pdqsort | 1000000 | i32 | 0.029800 | 0.029847 | ||
wolfsort | 1000000 | i32 | 0.013238 | 0.013355 | ||
quadsort | 1000000 | i32 | 0.009605 | 0.009673 | ||
stablesort | 1000000 | i32 | 0.013667 | 0.013785 | ||
timsort | 1000000 | i32 | 0.007994 | 0.008138 | ||
pdqsort | 1000000 | i32 | 0.024683 | 0.024727 | ||
wolfsort | 1000000 | i32 | 0.009642 | 0.009709 |
It should be noted that pdqsort is not a stable sort, so it works faster with data in normal order (general order).
The following benchmark was run on WSL gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1 ~ 18.04.1). The source code is compiled by the team
g++ -O3 quadsort.cpp
. Each test is run a hundred times and only the best run is reported. It measures performance on arrays ranging from 675 to 100,000 elements.
The X-axis of the graph below is the cardinality, the Y-axis is the execution time.
Quadsort datasheet, std :: stable_sort, timsort, pdqsort and wolfsort
quadsort | 100000 | i32 | 0.005800 | 0.005835 | ||
stablesort | 100000 | i32 | 0.006092 | 0.006122 | ||
timsort | 100000 | i32 | 0.007605 | 0.007656 | ||
pdqsort | 100000 | i32 | 0.002648 | 0.002670 | ||
wolfsort | 100000 | i32 | 0.001148 | 0.001168 | ||
quadsort | 10000 | i32 | 0.004568 | 0.004593 | ||
stablesort | 10000 | i32 | 0.004813 | 0.004923 | ||
timsort | 10000 | i32 | 0.006326 | 0.006376 | ||
pdqsort | 10000 | i32 | 0.002345 | 0.002373 | ||
wolfsort | 10000 | i32 | 0.001256 | 0.001274 | ||
quadsort | 5000 | i32 | 0.004076 | 0.004086 | ||
stablesort | 5000 | i32 | 0.004394 | 0.004420 | ||
timsort | 5000 | i32 | 0.005901 | 0.005938 | ||
pdqsort | 5000 | i32 | 0.002093 | 0.002107 | ||
wolfsort | 5000 | i32 | 0.000968 | 0.001086 | ||
quadsort | 2500 | i32 | 0.003196 | 0.003303 | ||
stablesort | 2500 | i32 | 0.003801 | 0.003942 | ||
timsort | 2500 | i32 | 0.005296 | 0.005322 | ||
pdqsort | 2500 | i32 | 0.001606 | 0.001661 | ||
wolfsort | 2500 | i32 | 0.000509 | 0.000520 | ||
quadsort | 1250 | i32 | 0.001767 | 0.001823 | ||
stablesort | 1250 | i32 | 0.002812 | 0.002887 | ||
timsort | 1250 | i32 | 0.003789 | 0.003865 | ||
pdqsort | 1250 | i32 | 0.001036 | 0.001325 | ||
wolfsort | 1250 | i32 | 0.000534 | 0.000655 | ||
quadsort | 675 | i32 | 0.000368 | 0.000592 | ||
stablesort | 675 | i32 | 0.000974 | 0.001160 | ||
timsort | 675 | i32 | 0.000896 | 0.000948 | ||
pdqsort | 675 | i32 | 0.000489 | 0.000531 | ||
wolfsort | 675 | i32 | 0.000283 | 0.000308 |
Benchmark: quadsort and qsort (mergesort)
The next benchmark was run on WSL gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1 ~ 18.04.1). The source code is compiled by the team
g++ -O3 quadsort.cpp
. Each test is run a hundred times and only the best run is reported.
quadsort: sorted 1000000 i32s in 0.119437 seconds. MO: 19308657 ( )
qsort: sorted 1000000 i32s in 0.133077 seconds. MO: 21083741 ( )
quadsort: sorted 1000000 i32s in 0.002071 seconds. MO: 999999 ()
qsort: sorted 1000000 i32s in 0.007265 seconds. MO: 3000004 ()
quadsort: sorted 1000000 i32s in 0.019239 seconds. MO: 4007580 ( )
qsort: sorted 1000000 i32s in 0.071322 seconds. MO: 20665677 ( )
quadsort: sorted 1000000 i32s in 0.076605 seconds. MO: 19242642 ( )
qsort: sorted 1000000 i32s in 0.038389 seconds. MO: 6221917 ( )
quadsort: sorted 1000000 i32s in 0.002305 seconds. MO: 999999 ()
qsort: sorted 1000000 i32s in 0.009659 seconds. MO: 4000015 ()
quadsort: sorted 1000000 i32s in 0.022940 seconds. MO: 9519209 ( )
qsort: sorted 1000000 i32s in 0.042135 seconds. MO: 13152042 ( )
quadsort: sorted 1000000 i32s in 0.034456 seconds. MO: 6787656 ( )
qsort: sorted 1000000 i32s in 0.098691 seconds. MO: 20584424 ( )
quadsort: sorted 1000000 i32s in 0.066240 seconds. MO: 11383441 ( )
qsort: sorted 1000000 i32s in 0.118086 seconds. MO: 20572142 ( )
quadsort: sorted 1000000 i32s in 0.038116 seconds. MO: 15328606 ( )
qsort: sorted 1000000 i32s in 4.471858 seconds. MO: 1974047339 ( )
quadsort: sorted 1000000 i32s in 0.060989 seconds. MO: 15328606 ()
qsort: sorted 1000000 i32s in 0.043175 seconds. MO: 10333679 ()
quadsort: sorted 1023 i32s in 0.016126 seconds. (random 1-1023)
qsort: sorted 1023 i32s in 0.033132 seconds. (random 1-1023)
MO indicates the number of comparisons performed on 1 million items.
The benchmark above compares quadsort to glibc qsort () through the same general purpose interface and without any known unfair advantages such as inlining.
random order: 635, 202, 47, 229, etc
ascending order: 1, 2, 3, 4, etc
uniform order: 1, 1, 1, 1, etc
descending order: 999, 998, 997, 996, etc
wave order: 100, 1, 102, 2, 103, 3, etc
stable/unstable: 100, 1, 102, 1, 103, 1, etc
random range: time to sort 1000 arrays ranging from size 0 to 999 containing random data
Benchmark: quadsort and qsort (quicksort)
This particular test is done using Cygwin's qsort () implementation, which uses quicksort under the hood. The source code is compiled by the team
gcc -O3 quadsort.c
. Each test is executed a hundred times, only the best run is reported.
quadsort: sorted 1000000 i32s in 0.119437 seconds. MO: 19308657 ( )
qsort: sorted 1000000 i32s in 0.133077 seconds. MO: 21083741 ( )
quadsort: sorted 1000000 i32s in 0.002071 seconds. MO: 999999 ()
qsort: sorted 1000000 i32s in 0.007265 seconds. MO: 3000004 ()
quadsort: sorted 1000000 i32s in 0.019239 seconds. MO: 4007580 ( )
qsort: sorted 1000000 i32s in 0.071322 seconds. MO: 20665677 ( )
quadsort: sorted 1000000 i32s in 0.076605 seconds. MO: 19242642 ( )
qsort: sorted 1000000 i32s in 0.038389 seconds. MO: 6221917 ( )
quadsort: sorted 1000000 i32s in 0.002305 seconds. MO: 999999 ()
qsort: sorted 1000000 i32s in 0.009659 seconds. MO: 4000015 ()
quadsort: sorted 1000000 i32s in 0.022940 seconds. MO: 9519209 ( )
qsort: sorted 1000000 i32s in 0.042135 seconds. MO: 13152042 ( )
quadsort: sorted 1000000 i32s in 0.034456 seconds. MO: 6787656 ( )
qsort: sorted 1000000 i32s in 0.098691 seconds. MO: 20584424 ( )
quadsort: sorted 1000000 i32s in 0.066240 seconds. MO: 11383441 ( )
qsort: sorted 1000000 i32s in 0.118086 seconds. MO: 20572142 ( )
quadsort: sorted 1000000 i32s in 0.038116 seconds. MO: 15328606 ( )
qsort: sorted 1000000 i32s in 4.471858 seconds. MO: 1974047339 ( )
quadsort: sorted 1000000 i32s in 0.060989 seconds. MO: 15328606 ()
qsort: sorted 1000000 i32s in 0.043175 seconds. MO: 10333679 ()
quadsort: sorted 1023 i32s in 0.016126 seconds. (random 1-1023)
qsort: sorted 1023 i32s in 0.033132 seconds. (random 1-1023)
In this benchmark, it becomes clear why quicksort is often preferable to traditional merge, it does fewer comparisons for data in ascending order, uniformly distributed, and for data in descending order. However, in most tests it performed worse than quadsort. Quicksort also exhibits an extremely slow sorting speed for wave-ordered data. The random data test shows that quadsort is more than twice as fast when sorting small arrays.
Quicksort is faster on general purpose and stable tests, but only because it is an unstable algorithm.
See also: