The price of naturalness or how to overtake QuickSort

Sorting is the same “eternal” topic for algorithmists, as love is for poets. It would seem that it is difficult to say a new word in this area, but come on - they continue to come up with new sorting algorithms (TimSort ...) There are, however, basic facts that every decent student knows. It is known, for example, that a universal sorting algorithm cannot be faster than O (n * log (n)) . Such a performance indicator has the famous QuickSort (invented in 1960 by Hoare), as well as merge sort (von Neumann) and heapsort. As for the elementary algorithms ("bubble", "inserts", "selection"), their indicator is much worse - O (n ^ 2) . But is QuickSort always the undisputed champion?



Indeed, in addition to the performance indicator, there are other characteristics, often equally important. One of them is naturalness . What it is? Sorting is called natural if it is faster when the array is almost sorted. And what array can be considered "almost sorted"? Here are two arrays of the same elements:



{1,2,9,3,4,5,7,6,8,10} and {9,1,6,3,10,5,4,2, 8.7}



Even by eye you can see that the first array is more ordered (only 9 and 7 are "out of place"). Whereas in the second array the numbers are mixed randomly. What is the measure of orderliness? The answer is known - the number of inversions. A pair of elements A [i] and A [j] for j> i constitute an inverse if A [j] <A [i]. (In this note, the purpose of sorting is always ascending order.)



Now we can give a precise definition: sorting is called natural if the speed of its operation decreases with a decrease in the number of inversions in the original array.

Naturalness is a rather “rare fruit” in the world of sorting - neither QuickSort nor Shell sort have this property, alas. But there is one algorithm that, one might say, is absolutely natural - it is insertion sort. Although this algorithm is known to every cultured person, let me briefly repeat its essence.



The array to be sorted is scanned once from beginning to end. As soon as it is found that the i-th and (i-1) -elements form an inversion, the i-th element is "thrown" back (which is achieved by shifting the desired segment of the beginning of the array to the right by one position). It is clear from this description that if there are few inversions in the array, the algorithm will "fly" through the array very quickly. If there are no inversions at all, then the algorithm will complete in O (n) time. But QuickSort in this case will be long and tedious to select a separating element, divide an already ordered array into segments, etc. But with a large number of inversions, the situation will, of course, be the opposite: the performance of insertion sort will drop to O (n ^ 2), and QuickSort will be the champion - O (n * log (n)).



This situation raises a natural question from my point of view: how many inversions outweigh naturalness and insertion sort works faster than QuickSort?



To answer this question, I ran a series of computational experiments. Their essence was as follows. Arrays of integers ranging from 3000 to 30,000 elements were taken, a certain number of inversions were introduced into them, then the arrays were sorted by insertions and quick sort. The sorting time was measured (in system ticks). For averaging, each sort was repeated 10 times. The results were as follows:



image



The picture shows the dependence of the sorting time on the number of introduced inversions. The raspberry series is, of course, QuickSort, and the blue series is insertion sort. For an array size of 30 thousand elements, up to about 400 thousand inversions "natural wins." For less ordered arrays, QuickSort is already more beneficial.



And the next picture shows the empirical dependence of the critical number of inversions on the size of the array.



image



It is easy to see that for an array of size n, the critical number of inversions is approximately 10 * n. With more inversions, QuickSort is beneficial. It should be noted that the maximum number of inversions for an array of length n is n * (n-1) / 2. The value 10 * n is a very insignificant part of them. Which is not surprising.



To what has been said, it remains to add that the results of such experiments depend on many factors (programming language, the type of data being sorted, etc.) To be honest, I was too lazy to investigate these issues in more detail. My experiments were done in Microsoft Excel in VBA environment:



Test source code
Private Declare Function GetTickCount Lib "kernel32" () As Long

':::   

Sub JSort(A() As Long)
    n& = UBound(A, 1)
    For i& = 2 To n
        If A(i&) < A(i& - 1) Then
           j& = i& - 1
           x& = A(i&)
           Do While (A(j&) > x&)
              A(j& + 1) = A(j&)
              j& = j& - 1
              If j& = 0 Then Exit Do
           Loop
           A(j& + 1) = x&
        End If
    Next i&
End Sub

':::  

Sub QSort(A() As Long, Optional b As Long = 1, Optional e As Long = 0)
    If (e - b) <= 1 Then Exit Sub
    i& = b
    j& = e
    w& = A((i& + j&) / 2)
    Do While (i& < j&)
      Do While (A(i&) < w&)
         i& = i& + 1
      Loop
      Do While (A(j&) > w&)
         j& = j& - 1
      Loop
      If i& < j& Then
         tmp& = A(i&)
         A(i&) = A(j&)
         A(j&) = tmp&
         i& = i& + 1
         j& = j& - 1
      End If
    Loop
    If j& > b Then QSort A, b, j&
    If i& < e Then QSort A, i&, e
End Sub

':::    ( )

Sub InsInv(A() As Long, k As Long)
    n& = UBound(A, 1)
    For i& = 1 To k
        Do
           k1& = n& * Rnd
           k2& = n& * Rnd
           If (k1& <> k2&) And (k1& >= 1) And (k2& >= 1) Then Exit Do
        Loop
        tmp& = A(k1&)
        A(k1&) = A(k2&)
        A(k2&) = tmp&
    Next i&
End Sub

':::     

Function NumInv(A() As Long) As Long
    n& = UBound(A, 1)
    For i& = 1 To n& - 1
        For j& = i& + 1 To n&
            If A(j&) < A(i&) Then NumInv = NumInv + 1
        Next j&
    Next i&
End Function

':::  

Sub Gtest_1()
Dim Eta() As Long
Dim Arr() As Long
    Size& = CLng(InputBox("Sz="))
    ReDim Eta(1 To Size&) As Long
    ReDim Arr(1 To Size&) As Long
    Randomize
    For i& = 1 To Size&
        Eta(i&) = i&
    Next i&
    q# = Size& * (Size& - 1) / 2
    For iii% = 1 To 10
        InsInv Eta, CLng(iii%)
        ni# = CDbl(NumInv(Eta))
        Cells(iii%, 1).Value = ni#  
        DoEvents
        S# = 0
        For l% = 1 To 10
            For i& = 1 To Size&
                Arr(i&) = Eta(i&)
            Next i&
            TBeg& = GetTickCount
            JSort Arr
            TEnd& = GetTickCount
            S# = S# + TEnd& - TBeg&
        Next l%
        Cells(iii%, 2).Value = S#
        DoEvents
        S# = 0
        For l% = 1 To 10
            For i& = 1 To Size&
                Arr(i&) = Eta(i&)
            Next i&
            TBeg& = GetTickCount
            QSort Arr, 1, Size&
            TEnd& = GetTickCount
            S# = S# + TEnd& - TBeg&
            If Not check(Arr) Then Exit Sub
        Next l%
        Cells(iii%, 3).Value = S#
        DoEvents
    Next iii%
    MsgBox "OK"
End Sub




Thanks for attention!



PS

And thanks to everyone who noted my mistakes! (Incorrect spelling of Timsort - in the first edition there was "TeamSort" and the missing "i" in QuickSort). And yes! - (specially for perfectionists) QuickSort can "degrade" to quadratic performance. But in this post I am considering not the worst, but the best use case for QuickSort.



All Articles