Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5308

Prefer MergeSort over QuickSort and try native TimSort first (with explanation)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.0.24
    • 2.0.25, 3.0.0 PDFBox
    • Text extraction
    • None

    Description

      Hello!

      I have 2 related proposals for improvement (so I am logging as one ticket, I am happy to split them if requested, however).

      1)

      Propose to use an iterative MergeSort implementation over an iterative QuickSort implementation in TextPosition sorting during text extraction.

      The reason: while QuickSort and MergeSort generally perform O(NlogN), and while MergeSort uses slightly more space (O(N)), QuickSort's worst-case performance is O(n2) and this case will, unfortunately, happen in (arguably) very common cases during text extraction: when the list is already sorted (or nearly sorted). Many PDFs will already stream text in the order of sorting. In many use-cases, QuickSort is among the worst-performing sorting algorithms for this particular task.

      This is why java's native sort on Collections (using TimSort - which assumes a high frequency of pre-sorted lists in real-world data) is ideal for many PDF text extraction sorting scenarios, and is a shame it is not being used. This brings me to the next part of the improvement proposal:

      2) 

      As the TextPositionComparator is not transitive, the current implementation to avoid JDK7+ enforcing transitivity, and throwing the "Comparison method violates its general contract" IllegalArgumentException is to check for java version, and if lower than 7, then ALWAYS use QuickSort.

      As TimSort often "approaches" O(N) for many PDF extraction use-cases, and as TextPositionComparator will not always cause an IllegalArgumentException during many sorting tasks, I propose to use java's native sort (which is relatively cheap), and catch the IllegalArgumentException (in many cases will be thrown early - not late), and then use whichever preferred "legacy" sorting algorithm on that failure.

      Here is my implementation to show what I mean:

       

       

      import java.util.Collections;
      import java.util.Comparator;
      import java.util.List;
      
      public class SortUtils {
       public static <T> void sort(List<T> list, Comparator<T> comparator) {
         try{    
          Collections.sort(list, comparator);    
         }catch(java.lang.IllegalArgumentException e) {    
           MergeSort.sort(list, comparator);    
        } 
       }
      }
       
      

      This approach  (even when used on jdk7+) has a significant impact on sorting costs on real-world PDFs (even with the occasional "double-attempt").

      I am also happy to share the MergeSort implementation I am using (which follows the sort contract for generics). It is iterative (not recursive).  It is borrowed from here:

      https://github.com/vlab-cs-ucsb/cashew/blob/master/jpf-security/src/examples/benchmarks/MergeSortIterative.java

       

      It has some slight tweaks (uses System.arraycopy, etc,) and is "generic-ified". Please just let me know. 

      I am also curious to know any reasons (which I may have missed), why QuickSort is the preferred sort (if the fact that many lists are mostly pre-sorted is already known and considered)?

      Thanks!

       

       

       

      Attachments

        1. sort-benchmark-02.txt
          80 kB
          Alistair Oldfield
        2. sort-benchmark.txt
          120 kB
          Alistair Oldfield

        Activity

          People

            tilman Tilman Hausherr
            alistairo Alistair Oldfield
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: