Lucene - Core
  1. Lucene - Core
  2. LUCENE-4839

Sorter API: Use TimSort to sort doc IDs and postings lists

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      TimSort (http://svn.python.org/projects/python/trunk/Objects/listsort.txt, used by python and Java's Arrays.sort(Object[]) in particular) is a sorting algorithm that performs very well on partially-sorted data. Indeed, with TimSort, sorting an array which is in reverse order or a finite concatenation of sorted arrays is a linear operation (instead of O(n ln)).

      The sorter API could benefit from this algorithm when using Sorter.REVERSE_DOCS or merging several sorted readers for example.

      1. LUCENE-4839.patch
        17 kB
        Adrien Grand
      2. LUCENE-4839.patch
        30 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          If you want to port TimSort to Lucene, would it be possible to do it based on SorterTemplate? So you could implement SorterTemplate and use it like: new SorterTemplate()

          { ... }

          .timSort(from,to)

          Show
          Uwe Schindler added a comment - If you want to port TimSort to Lucene, would it be possible to do it based on SorterTemplate? So you could implement SorterTemplate and use it like: new SorterTemplate() { ... } .timSort(from,to)
          Hide
          Adrien Grand added a comment -

          If you want to port TimSort to Lucene, would it be possible to do it based on SorterTemplate?

          Here it is.

          Show
          Adrien Grand added a comment - If you want to port TimSort to Lucene, would it be possible to do it based on SorterTemplate? Here it is.
          Hide
          Adrien Grand added a comment -

          One major difference with the original impl is that I reused the merge routine used by mergeSort instead of porting the original one which has a few optimizations to merge runs which have different lengths and/or some patterns (look for "galloping" in listsort.txt) but requires extra memory. This doesn't change the fact that this impl performs extremely well when data is partially sorted.

          Show
          Adrien Grand added a comment - One major difference with the original impl is that I reused the merge routine used by mergeSort instead of porting the original one which has a few optimizations to merge runs which have different lengths and/or some patterns (look for "galloping" in listsort.txt) but requires extra memory. This doesn't change the fact that this impl performs extremely well when data is partially sorted.
          Hide
          Uwe Schindler added a comment - - edited

          Nice! Why do we need the private inner class TimSort?

          I would be happy to also add the timSort algorithm to ArrayUtils and CollectionUtils. Its just adding some convenience methods using the already existing SorterTemplates for arrays and collections.
          The bonus would be: The extensive random tests in TestArrayUtils and TestCollectionUtils could be used for timSort, too (their existence is the reason why there is no TestSorterTemplate class in current code).

          Show
          Uwe Schindler added a comment - - edited Nice! Why do we need the private inner class TimSort? I would be happy to also add the timSort algorithm to ArrayUtils and CollectionUtils. Its just adding some convenience methods using the already existing SorterTemplates for arrays and collections. The bonus would be: The extensive random tests in TestArrayUtils and TestCollectionUtils could be used for timSort, too (their existence is the reason why there is no TestSorterTemplate class in current code).
          Hide
          Yonik Seeley added a comment -

          Sweet!
          Extra nostalgia points - I remember when timsort was put into python. I hadn't realized that Java7's object sorting was based on it though.

          Show
          Yonik Seeley added a comment - Sweet! Extra nostalgia points - I remember when timsort was put into python. I hadn't realized that Java7's object sorting was based on it though.
          Hide
          Adrien Grand added a comment -

          Nice! Why do we need the private inner class TimSort?

          It's no needed but my first patch (not uploaded) did not use a helper class and was hard to read, so I think this is better this way?

          I would be happy to also add the timSort algorithm to ArrayUtils and CollectionUtils.

          Done in the patch.

          The bonus would be: The extensive random tests in TestArrayUtils and TestCollectionUtils could be used for timSort, too (their existence is the reason why there is no TestSorterTemplate class in current code).

          Done.

          Show
          Adrien Grand added a comment - Nice! Why do we need the private inner class TimSort? It's no needed but my first patch (not uploaded) did not use a helper class and was hard to read, so I think this is better this way? I would be happy to also add the timSort algorithm to ArrayUtils and CollectionUtils. Done in the patch. The bonus would be: The extensive random tests in TestArrayUtils and TestCollectionUtils could be used for timSort, too (their existence is the reason why there is no TestSorterTemplate class in current code). Done.
          Hide
          Uwe Schindler added a comment -

          Looks good!

          There is one copypaste error in TestArrayUtil#testTimSort. The first test still uses mergeSort!

          Show
          Uwe Schindler added a comment - Looks good! There is one copypaste error in TestArrayUtil#testTimSort. The first test still uses mergeSort!
          Hide
          Adrien Grand added a comment -

          Thanks UWe, I'll fix it before committing!

          Show
          Adrien Grand added a comment - Thanks UWe, I'll fix it before committing!
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1457272

          LUCENE-4839: Sorter API: Use TimSort to sort doc IDs and postings lists.

          Show
          Commit Tag Bot added a comment - [trunk commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1457272 LUCENE-4839 : Sorter API: Use TimSort to sort doc IDs and postings lists.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1457274

          LUCENE-4839: Sorter API: Use TimSort to sort doc IDs and postings lists (merged from r1457272).

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1457274 LUCENE-4839 : Sorter API: Use TimSort to sort doc IDs and postings lists (merged from r1457272).
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1457285

          Merged revision(s) 1457284 from lucene/dev/trunk:
          LUCENE-4839: Javadocs links, prevent synthetic accessor method access$0() for merge()

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1457285 Merged revision(s) 1457284 from lucene/dev/trunk: LUCENE-4839 : Javadocs links, prevent synthetic accessor method access$0() for merge()
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1457284

          LUCENE-4839: Javadocs links, prevent synthetic accessor method access$0() for merge()

          Show
          Commit Tag Bot added a comment - [trunk commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1457284 LUCENE-4839 : Javadocs links, prevent synthetic accessor method access$0() for merge()
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1457315

          LUCENE-4839: Fix bug in SorterTemplate.timSort().

          Show
          Commit Tag Bot added a comment - [trunk commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1457315 LUCENE-4839 : Fix bug in SorterTemplate.timSort().
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1457321

          LUCENE-4839: Fix bug in SorterTemplate.timSort() (merged from r1457315).

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1457321 LUCENE-4839 : Fix bug in SorterTemplate.timSort() (merged from r1457315).

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development