# Fast incremental sort

I recently came across the need for an incremental sorting algorithm, and started to wonder how to do it optimally.

The incremental sorting problem is described here , and is an “online” version of partial sort. That is, if you have already sorted

elements, you should be able to quickly sort

elements, and so on.

#### Use Cases

Incremental sorts can be useful for a number of cases:

• You want sorted items, but you don’t know how many elements you’ll need.
• This could often happen when you are filtering the resulting sequence, and you don’t know how many items will be filtered out.
• You are streaming the sequence, so even though you want the whole sequence, you want the first elements as quickly as possible.

We’ll see how branch misprediction and other constant factors can make the naive asymptotically optimal version far slower than a practical implementation.

#### Measuring Performance

Measuring the performance of an incremental sort is a bit more involved than full sort, because for a given

we care about how fast we reach eachelement. To keep it simple, I’ll keepfixed at 10 million, and measure the time taken to reach each

element. The input range is a vector of random integers.

#### Pre-Sorting

Perhaps the simplest possible solution is pre-sorting the entire range, and then just using plain iterators to iterate over the sorted range.

``template<typename I> class pre_sorter { public:   pre_sorter(I i1, I i2) : first(i1), last(i2) {     std::sort(first, last);   }    I begin() { return first; }   I end() { return last; }  private:   I first;   I last; };``

Unsurprisingly, the performance of this algorithm is virtually independent of

, as all the work is done at the start. This is obviously, not a good solution, but it’s a nice reference point, especially whenapproaches

.

#### Partial Sorting

To avoid doing all the sorting work up front, the next logical step is trying to use something like `std::partial_sort` to do the work. The idea is to to start by partially sorting a constant small part of the range and remember how much of the range is sorted. When you run out of sorted elements, do another partial sort.

Since a partial sort must look at all the remaining elements of the range, we can’t keep partially sorting small ranges, that would lead to quadratic complexity. Instead, we multiply the number of elements to sort each time by some constant, giving us a logarithmic number of partial sorts.

``template<typename I> class partial_sorter { public:   partial_sorter(I i1, I i2) : first(i1), last(i2) {}    class iterator {   public:     ...     iterator& operator++() {       ensure_sorted_at_current();       ++current;       return *this;     }      value_type& operator*() {       ensure_sorted_at_current();       return *current;     }    private:     void ensure_sorted_at_current() {       if (current == sort_end) {         sort_end =              sort_size < end - sort_end ?             sort_end + sort_size :             end;         std::partial_sort(current, sort_end, end);         sort_size *= 2;       }     }     I current;     I sort_end;     I end;     int sort_size = 100;   };   ... };``

This method is a huge improvement for small

, but it performs worse for

1 million.

Since the `std::partial_sort` algortithm costs

, thecalls to `std::partial_sort` has a total worst case run time of

. This means that in total we are doing more work than we should, time to fix this.

#### Incremental Quicksort

In a paper from 2006, Paredes and Navarro describes an algorithm which uses an incremental version of quicksort to achieve the optimal bound of

for incremental sorting.

This method keeps a stack of partition points for the range. Each partition point is an iterator to a position in the range where everything before the position is smaller than everything after. The stack is sorted such that the top of the stack is the position closest to the start of the range.

This stack has the wonderful property that when you are looking for the next element in the sorted range, and your current position is not equal to the top of the stack, the next element must be between the current position and the top of the stack.

If the distance to the next partition point is large, you cut it in half with `std::partition` , and push the new partition point on the stack. If the distance is small, we can sort the small range with `std::sort` , and pop the element from the stack. A simple version is shown here:

``template<typename I> class simple_quick_sorter { public:   simple_quick_sorter(I i1, I i2) : first(i1), last(i2) {}    class iterator {   public:     ...     iterator& operator++() {       ensure_sorted_at_current();       ++current;       return *this;     }      value_type& operator*() {       ensure_sorted_at_current();       return *current;     }    private:     void ensure_sorted_at_current() {       if (current == sort_end) {         while (stack.back() - sort_end > sort_limit) {           auto range_size = stack.back() - current;           value_type pivot = *(current + (mt() % range_size));           auto it = std::partition(             current,             stack.back(),             [=](const value_type& v) { return v < pivot; });           while (it == current) {             pivot = *(current + (mt() % range_size));             it = std::partition(               current,               stack.back(),               [=](const value_type& v) { return v < pivot; });           }           stack.push_back(it);         }          std::sort(sort_end, stack.back());         sort_end = stack.back();         stack.pop_back();       }     }     I begin;     I current;     I sort_end;     std::vector<I> stack;     static const int sort_limit = 100;     std::mt19937 mt{ std::random_device{}() };   };   ... };``

Looking at the results for this algorithm, it’s clear that the total amount of work done when

is now optimal, because the time to sort the whole range is now as fast as a single call to `std::sort` . This has the benefit that you don’t need to worry about switching over to `std::sort`

if you know that you’ll need a large part of the result, you can just use the incremental version anyway.

The downside of this algorithm, though, is that it does spend considerably more time than `std::partial_sort` coming up with the first few elements. If I need between 1 and 5000 elements, partial sorting is still the way to go by the looks of it. Why is that?

Number of comparisons

Even though `std::partial_sort` is only guaranteed to have

, a typical implementation will have closer tocomparisons for smalland random data. This is because the impementation typically makes a max heap of the first

elements, and then walks through the rest of the range, inserting elements if they are smaller than the current max.

When

is small, the elements in the heap will quickly become much smaller on average than most elements in the rest of the range, so insertions into the heap become less frequent as the range is processed. That means that for most of the elements in the range, the algorithm will just do a single comparison to verify that the element should not be in heap.

The incremental quick sort implementation, on the other hand, does a few more comparisons to find the first element. The first partition operation does

comparisons to cut the range in two, the next partition takescomparisons to cut the first half down to a quarter, and so on. In total, it does about

comparisons to find the first element, twice that of a good partial sort.

The difference in performance is much more than the difference in comparisons can explain, however, which brings us to the next point.

Branch Mispredictions

Few comparisons is not the only benefit of partial sort, it also has the benefit that most of the comparisons gives the same result. This gives a very fast inner loop, because it avoids branch misprediction .

For a partition operation that cuts the range in half, the number of branch mispredictions is at a peak, about every other comparison will be a misprediction. To verify that this was a big contributing factor, I tested `std::partition` with various pivot elements, that cut the range at various percentiles, to see the effect. I also compared with two hand-written loops (one normal branching and one branchless) to see that effect added up.

We see that partition speed is heavily impacted by the branch mispredictions, it seems to account for up to

difference in speed.

#### Skewed Incremental Quick Sort

After looking at the branch misprediction measurements, I realized that a balanced partition operation could not be the basis for an optimal incremental sort, but a skewed one could. If we skew the first partitions towards the beginning of the range, we get both fewer comparisons and branch mispredictions. The skew comes with a cost, however, as there is more work to do in total if we consistently get very skewed partitions. To counteract this effect, only the first partitions should be skewed. As we iterate further into the range, the partitions should be made less skewed, so we get about the right amount of work.

In the implementation of skewed incremental quicksort, I skewed the partitions by sampling a large number of pivots, sorting them, and choosing the pivot corresponding to the desired amount of skew.

``value_type get_pivot(int num_pivots, int pivot_idx) {   assert(num_pivots <= 100);   std::array<value_type, 100> pivots;   for (int i = 0; i < num_pivots; ++i) {  pivots[i] = *(current + (mt() % (stack.back() - current)));   }    std::sort(pivots.begin(), pivots.begin() + num_pivots);   return pivots[pivot_idx]; }  value_type get_pivot() {   auto range_size = stack.back() - current;    if (range_size < 1000) {  return get_pivot(3, 1);   }    double ratio =      (current - begin) / static_cast<double>(stack.back() - begin);    if (range_size < 10'000) {  int idx = ratio * 10;  idx = std::min(idx, 5);  idx = std::max(idx, 1);  return get_pivot(10, idx);   }      int idx = ratio * 10;   idx = std::min(idx, 50);   idx = std::max(idx, 1);   return get_pivot(100, idx); }``

#### Summary

By combining the incremental quick sort algorithm with smart pivot selection, we can get an incremental sorting algorithm that is as fast as partial sort for small

, and as fast as full sort whenapproaches

.

This is my first real blog post, so comments and suggestions are welcome!