Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3357

Sorter's quicksort implementation is very suboptimal for duplicate keys

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Later
    • Impala 2.6.0
    • None
    • Backend

    Description

      void Sorter::TupleSorter::SortHelper(TupleIterator first, TupleIterator last) {
        if (UNLIKELY(state_->is_cancelled())) return;
        // Use insertion sort for smaller sequences.
        while (last.index_ - first.index_ > INSERTION_THRESHOLD) {
          TupleIterator iter(this, first.index_ + (last.index_ - first.index_) / 2);
          DCHECK(iter.current_tuple_ != NULL);
          // Partition() splits the tuples in [first, last) into two groups (<= pivot
          // and >= pivot) in-place. 'cut' is the index of the first tuple in the second group.
          TupleIterator cut = Partition(first, last,
              reinterpret_cast<Tuple*>(iter.current_tuple_));
          SortHelper(cut, last);
          last = cut;
          if (UNLIKELY(state_->is_cancelled())) return;
        }
      
        InsertionSort(first, last);
      }
      

      The quicksort implementation in the sorter is based on dividing the input into two partitions: <= pivot and >= pivot.

      If all of the input values in a partition are equal, then it will still recursively divide and do insertion sort on the values. We could change the sorter to partition the input into three partitions: <, == and >. Then it doesn't need to recurse on the middle partition. This would mean it could sort a partition full of duplicate values in a single pass over the input.

      Attachments

        Issue Links

          Activity

            People

              norbertluksa Norbert Luksa
              tarmstrong Tim Armstrong
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: