Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When working on TimSort (LUCENE-4839), I was a little frustrated of not being able to add galloping support because it would have required to add new primitive operations in addition to compare and swap.

      I started working on a prototype that uses inheritance to allow some sorting algorithms to rely on additional primitive operations. You can have a look at https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but beware it is a prototype and still misses proper documentation and good tests).

      I think it would offer several advantages:

      • no more need to implement setPivot and comparePivot when using in-place merge sort or insertion sort,
      • the ability to use faster stable sorting algorithms at the cost of some memory overhead (our in-place merge sort is very slow),
      • the ability to implement properly algorithms that are useful on specific datasets but require different primitive operations (such as TimSort for partially-sorted data).

      If you are interested in comparing these implementations with Arrays.sort, there is a Benchmark class in src/examples.

      What do you think?

      1. LUCENE-4946.patch
        130 kB
        Adrien Grand
      2. LUCENE-4946.patch
        130 kB
        Adrien Grand
      3. LUCENE-4946.patch
        130 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          Dawid Weiss added a comment -

          Looks cool to me!

          Show
          Dawid Weiss added a comment - Looks cool to me!
          Hide
          Adrien Grand added a comment -

          This patch contains one base class Sorter and 3 implementations:

          • IntroSorter (improved quicksort like we had before but I think the name is better since it makes it clear that the worst case complexity is O(n ln) instead of O(n^2) as with traditional quicksort
          • InPlaceMergeSort, the merge sort we had before.
          • TimSort, an improved version of the previous implementation that can gallop to make sorting even faster on partially-sorted data.

          One major difference is that the end offsets are now exclusive. I tend to find it less confusing since you would now call sort(0, array.length) instead of sort(0, array.length - 1).

          Please let me know if you would like to review the patch!

          Show
          Adrien Grand added a comment - This patch contains one base class Sorter and 3 implementations: IntroSorter (improved quicksort like we had before but I think the name is better since it makes it clear that the worst case complexity is O(n ln ) instead of O(n^2) as with traditional quicksort InPlaceMergeSort, the merge sort we had before. TimSort, an improved version of the previous implementation that can gallop to make sorting even faster on partially-sorted data. One major difference is that the end offsets are now exclusive. I tend to find it less confusing since you would now call sort(0, array.length) instead of sort(0, array.length - 1) . Please let me know if you would like to review the patch!
          Hide
          Adrien Grand added a comment -

          Add missing @lucene.internal.

          Show
          Adrien Grand added a comment - Add missing @lucene.internal.
          Hide
          Uwe Schindler added a comment - - edited

          Hi Adrien,
          thansk for the refactoring. The history of the SorterTemplate class going back to CGLIB is long and this is a really good idea. Its also useful for other projects, so its maybe a good idea to make a Apache Commons projects out of it

          I scanned the patch, looks good. The from...to semantics are better now for the user. I think the original implementation used inclusive end because most implementations on the web were based on this. For me it always looked wrong, but I did not want to change it.

          I found some code duplication: To me it looks like ArrayUtil has a private re-implementation of ArrayIntroSorter which is a top-level class in oal.util. Could ArrayUtil not simply use that public impl instead? I know there are 2 implementations with Comparators and without comparators, just an idea! Maybe add a static final singleton NaturalComparator<T extends Comparable<? super T>> that calls compareTo, so we dont need 2 implementations.

          I also like that you used timsort at places were the lists are already sorted in the common case (like Automatons).

          Show
          Uwe Schindler added a comment - - edited Hi Adrien, thansk for the refactoring. The history of the SorterTemplate class going back to CGLIB is long and this is a really good idea. Its also useful for other projects, so its maybe a good idea to make a Apache Commons projects out of it I scanned the patch, looks good. The from...to semantics are better now for the user. I think the original implementation used inclusive end because most implementations on the web were based on this. For me it always looked wrong, but I did not want to change it. I found some code duplication: To me it looks like ArrayUtil has a private re-implementation of ArrayIntroSorter which is a top-level class in oal.util. Could ArrayUtil not simply use that public impl instead? I know there are 2 implementations with Comparators and without comparators, just an idea! Maybe add a static final singleton NaturalComparator<T extends Comparable<? super T>> that calls compareTo, so we dont need 2 implementations. I also like that you used timsort at places were the lists are already sorted in the common case (like Automatons).
          Hide
          Uwe Schindler added a comment -

          We should remove the following from NOTICE.txt:

          The class org.apache.lucene.util.SorterTemplate was inspired by CGLIB's class
          with the same name. The implementation part is mainly done using pre-existing
          Lucene sorting code. In-place stable mergesort was borrowed from CGLIB,
          which is Apache-licensed.

          The new code has no similarity anymore to the original code - its a complete reimplementation. Only the "pattern" stayed alive (you have abstract class, where you have to implement the compare and swap ops).

          Show
          Uwe Schindler added a comment - We should remove the following from NOTICE.txt: The class org.apache.lucene.util.SorterTemplate was inspired by CGLIB's class with the same name. The implementation part is mainly done using pre-existing Lucene sorting code. In-place stable mergesort was borrowed from CGLIB, which is Apache-licensed. The new code has no similarity anymore to the original code - its a complete reimplementation. Only the "pattern" stayed alive (you have abstract class, where you have to implement the compare and swap ops).
          Hide
          Dawid Weiss added a comment -

          I think the original implementation used inclusive end because most implementations on the web were based on this. For me it always looked wrong, but I did not want to change it.

          I admit I am on the 'inclusive' side of things. To me sort(0..5) means sort elements between indexes 0 and 5, simple. There is also a side-effect of making it exclusive – you can't sort the full array because an exclusive index on any end would overflow into negative values. I guess it's really a matter of taste in most cases. Perhaps the best way to change it would be to give (startIndex, elementsCount) which still reads (0, array.length) in most cases and does not have the problems mentioned above...

          Show
          Dawid Weiss added a comment - I think the original implementation used inclusive end because most implementations on the web were based on this. For me it always looked wrong, but I did not want to change it. I admit I am on the 'inclusive' side of things. To me sort(0..5) means sort elements between indexes 0 and 5, simple. There is also a side-effect of making it exclusive – you can't sort the full array because an exclusive index on any end would overflow into negative values. I guess it's really a matter of taste in most cases. Perhaps the best way to change it would be to give (startIndex, elementsCount) which still reads (0, array.length) in most cases and does not have the problems mentioned above...
          Hide
          Adrien Grand added a comment -

          Its also useful for other projects, so its maybe a good idea to make a Apache Commons projects out of it.

          Why not. Or maybe use an already existing commons project such as commons collections? I'll dig that...

          I found some code duplication

          I'll fix that. The reason is that I modified ArrayUtil and CollectionUtil which have their own private Sorter implementations and then I added tests which required me to have concrete implementations in src/test. I'll merge them.

          We should remove the following from NOTICE.txt

          I'll fix that too.

          Perhaps the best way to change it would be to give (startIndex, elementsCount) which still reads (0, array.length) in most cases and does not have the problems mentioned above...

          I have no strong opinion about that. I think the reason I like the (from,to) option better is that List.subList and Arrays.copyOfRange have the same arguments. For example someone who wants to sort a sub-list with the JDK would do Collections.sort(list.subList(from,to)). So I think it'd be nice to make directly translatable to new InPlaceMergeSorter() { compare/swap }.sort(from, to).

          Show
          Adrien Grand added a comment - Its also useful for other projects, so its maybe a good idea to make a Apache Commons projects out of it. Why not. Or maybe use an already existing commons project such as commons collections? I'll dig that... I found some code duplication I'll fix that. The reason is that I modified ArrayUtil and CollectionUtil which have their own private Sorter implementations and then I added tests which required me to have concrete implementations in src/test. I'll merge them. We should remove the following from NOTICE.txt I'll fix that too. Perhaps the best way to change it would be to give (startIndex, elementsCount) which still reads (0, array.length) in most cases and does not have the problems mentioned above... I have no strong opinion about that. I think the reason I like the (from,to) option better is that List.subList and Arrays.copyOfRange have the same arguments. For example someone who wants to sort a sub-list with the JDK would do Collections.sort(list.subList(from,to)) . So I think it'd be nice to make directly translatable to new InPlaceMergeSorter() { compare/swap }.sort(from, to) .
          Hide
          Adrien Grand added a comment -

          New Patch:

          • no more code duplication between ArrayUtil and the test classes
          • ArrayUtil exposes a NATURAL_COMPARATOR to sort arrays based on the natural order (for objects that implement Comparable)
          • Removed references to CGlib in the NOTICE.
          Show
          Adrien Grand added a comment - New Patch: no more code duplication between ArrayUtil and the test classes ArrayUtil exposes a NATURAL_COMPARATOR to sort arrays based on the natural order (for objects that implement Comparable) Removed references to CGlib in the NOTICE.
          Hide
          Uwe Schindler added a comment -

          +1, looks good.

          About the from/to issue: The whole JDK collections API used from and exclusive to, so I agree with Adrien, we should do it in the same way. The overflow issue is no real issue as the meximum array size is limited, too new byte[Integer.MAX_VALUE] does not work on most JDKs.

          Show
          Uwe Schindler added a comment - +1, looks good. About the from/to issue: The whole JDK collections API used from and exclusive to, so I agree with Adrien, we should do it in the same way. The overflow issue is no real issue as the meximum array size is limited, too new byte [Integer.MAX_VALUE] does not work on most JDKs.
          Hide
          Dawid Weiss added a comment -

          I still think inclusive ranges are more logical . For JDK subList and others the argument probably was that specifying inclusive zero-elements range becomes problematic with inclusive values.... so there's always something. I'm not objecting to choosing the "exclusive" option either, I'm just saying both options have their pros and cons.

          Show
          Dawid Weiss added a comment - I still think inclusive ranges are more logical . For JDK subList and others the argument probably was that specifying inclusive zero-elements range becomes problematic with inclusive values.... so there's always something. I'm not objecting to choosing the "exclusive" option either, I'm just saying both options have their pros and cons.
          Hide
          Adrien Grand added a comment -

          make a Apache Commons projects out of it

          I just left an email on their dev@ mailing-list to get their opinion about it: http://markmail.org/message/if5cgarhavzuy45j.

          Show
          Adrien Grand added a comment - make a Apache Commons projects out of it I just left an email on their dev@ mailing-list to get their opinion about it: http://markmail.org/message/if5cgarhavzuy45j .
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] jpountz
          http://svn.apache.org/viewvc?view=revision&revision=1478785

          LUCENE-4946: Refactor SorterTemplate (now Sorter).

          Show
          Commit Tag Bot added a comment - [trunk commit] jpountz http://svn.apache.org/viewvc?view=revision&revision=1478785 LUCENE-4946 : Refactor SorterTemplate (now Sorter).
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] jpountz
          http://svn.apache.org/viewvc?view=revision&revision=1478801

          LUCENE-4946: Re-add the random-access checks that have been lost during refactoring.

          Show
          Commit Tag Bot added a comment - [trunk commit] jpountz http://svn.apache.org/viewvc?view=revision&revision=1478801 LUCENE-4946 : Re-add the random-access checks that have been lost during refactoring.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] jpountz
          http://svn.apache.org/viewvc?view=revision&revision=1478802

          LUCENE-4946: Refactor SorterTemplate (now Sorter) (merged from r1478785 and r1478801).

          Show
          Commit Tag Bot added a comment - [branch_4x commit] jpountz http://svn.apache.org/viewvc?view=revision&revision=1478802 LUCENE-4946 : Refactor SorterTemplate (now Sorter) (merged from r1478785 and r1478801).
          Hide
          Steve Rowe added a comment -

          Bulk close resolved 4.4 issues

          Show
          Steve Rowe added a comment - Bulk close resolved 4.4 issues
          Hide
          ASF subversion and git services added a comment -

          Commit 1508757 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1508757 ]

          LUCENE-5140: Fixed performance regression of span queries caused by LUCENE-4946.

          Show
          ASF subversion and git services added a comment - Commit 1508757 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1508757 ] LUCENE-5140 : Fixed performance regression of span queries caused by LUCENE-4946 .

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development