Lucene - Core
  1. Lucene - Core
  2. LUCENE-3054

SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.1
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Looking at Otis's sort problem on the mailing list, he said:

      * looked for other places where this call is made - found it in
      MultiPhraseQuery$MultiPhraseWeight and changed that call from
      ArrayUtil.quickSort to ArrayUtil.mergeSort
      * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
      thread dump
      

      I thought this was interesting because PostingsAndFreq's comparator
      looks like it needs a tiebreaker.

      I think in our sorts we should add some asserts to try to catch some of these broken comparators.

      1. LUCENE-3054.patch
        6 kB
        Uwe Schindler
      2. LUCENE-3054.patch
        6 kB
        Uwe Schindler
      3. LUCENE-3054.patch
        6 kB
        Uwe Schindler
      4. LUCENE-3054.patch
        6 kB
        Uwe Schindler
      5. LUCENE-3054.patch
        6 kB
        Robert Muir
      6. LUCENE-3054.patch
        2 kB
        Robert Muir
      7. LUCENE-3054-dynamic.patch
        6 kB
        Uwe Schindler
      8. LUCENE-3054-stackoverflow.patch
        0.9 kB
        Uwe Schindler

        Activity

        Hide
        Robert Muir added a comment -

        really ugly prototype... i expect the generics/sort policeman will want to jump in here anyway

        but it does catch that problem:

            [junit] Testsuite: org.apache.lucene.index.TestCodecs
            [junit] Testcase: testSepPositionAfterMerge(org.apache.lucene.index.TestCodecs):    FAILED
            [junit] insane comparator for: org.apache.lucene.search.PhraseQuery$PostingsAndFreq
        
        Show
        Robert Muir added a comment - really ugly prototype... i expect the generics/sort policeman will want to jump in here anyway but it does catch that problem: [junit] Testsuite: org.apache.lucene.index.TestCodecs [junit] Testcase: testSepPositionAfterMerge(org.apache.lucene.index.TestCodecs): FAILED [junit] insane comparator for: org.apache.lucene.search.PhraseQuery$PostingsAndFreq
        Hide
        Otis Gospodnetic added a comment -

        Btw. this is with Lucene 3.1
        For full thread: http://search-lucene.com/m/ytANA59Q9G1

        Show
        Otis Gospodnetic added a comment - Btw. this is with Lucene 3.1 For full thread: http://search-lucene.com/m/ytANA59Q9G1
        Hide
        Robert Muir added a comment -

        i expanded the patch to all the sorts, just to find all the wierd sorting/comparators going on.

        it also finds some false positives, ones that are documented as inconsistent with equals, ones in tests, etc.

        but we can at least look into the ones it finds.

        Show
        Robert Muir added a comment - i expanded the patch to all the sorts, just to find all the wierd sorting/comparators going on. it also finds some false positives, ones that are documented as inconsistent with equals, ones in tests, etc. but we can at least look into the ones it finds.
        Hide
        Uwe Schindler added a comment -

        I investigated what happens here:

        The problem is indeed quickSort, but not undernormal circumstances. The problem with quickSort (just google for stack overflow and quicksort) is that it only works fine for arrays with many values. Once you only have few distinct values and a large array, depending on the oreder it may happen that it splits into two subarrays for next iteration, where one is very large and the other only contains few items.

        Attached is a patch, that shows the problem. It almost every time stack overflows. Also quicksort is very slow for this case.

        This is exactly what happens on PhraseQuery: we only have very few distinct items and possibly a very huge array. To fix this, we should change PhraseQuery to use mergeSort instead. Mergesort is also much faster in this case, as it always splits the array in the center. So the number of iterations is limited.

        For TermsHash/BytesRefHash its mostly also not a problem, as the values (the terms are 100% distict, as only the hash is sorted).

        But there may still be the slight chance this messes up. I propose to change SorterTemplate to fall back to mergeSort once it checks that number of iterations grows e.g. > 20 (have to test a little bit).

        I will change that issue to higher priority and we also need to backport to 3.1.

        Show
        Uwe Schindler added a comment - I investigated what happens here: The problem is indeed quickSort, but not undernormal circumstances. The problem with quickSort (just google for stack overflow and quicksort) is that it only works fine for arrays with many values. Once you only have few distinct values and a large array, depending on the oreder it may happen that it splits into two subarrays for next iteration, where one is very large and the other only contains few items. Attached is a patch, that shows the problem. It almost every time stack overflows. Also quicksort is very slow for this case. This is exactly what happens on PhraseQuery: we only have very few distinct items and possibly a very huge array. To fix this, we should change PhraseQuery to use mergeSort instead. Mergesort is also much faster in this case, as it always splits the array in the center. So the number of iterations is limited. For TermsHash/BytesRefHash its mostly also not a problem, as the values (the terms are 100% distict, as only the hash is sorted). But there may still be the slight chance this messes up. I propose to change SorterTemplate to fall back to mergeSort once it checks that number of iterations grows e.g. > 20 (have to test a little bit). I will change that issue to higher priority and we also need to backport to 3.1.
        Hide
        Robert Muir added a comment -

        I propose to change SorterTemplate to fall back to mergeSort once it checks that number of iterations grows e.g. > 20 (have to test a little bit).

        I like the idea of some "guard" here to prevent the stack overflow, and hopefully keep the quickSort performance for the places where we know its better than mergesort.

        Show
        Robert Muir added a comment - I propose to change SorterTemplate to fall back to mergeSort once it checks that number of iterations grows e.g. > 20 (have to test a little bit). I like the idea of some "guard" here to prevent the stack overflow, and hopefully keep the quickSort performance for the places where we know its better than mergesort.
        Hide
        Uwe Schindler added a comment -

        Patch that shows the issue.

        Show
        Uwe Schindler added a comment - Patch that shows the issue.
        Hide
        Uwe Schindler added a comment -

        As quicksort gets insanely slow when these type of data gets sorted, this also explains Otis' slowdown.

        Show
        Uwe Schindler added a comment - As quicksort gets insanely slow when these type of data gets sorted, this also explains Otis' slowdown.
        Hide
        Uwe Schindler added a comment -

        Due to the realtime merge (LUCENE-3023), suddenly DocFieldProcessor got a reincarnation of quicksort again... will remove, too

        Show
        Uwe Schindler added a comment - Due to the realtime merge ( LUCENE-3023 ), suddenly DocFieldProcessor got a reincarnation of quicksort again... will remove, too
        Hide
        Uwe Schindler added a comment -

        Here the patch that combines Robert's optimization for PhraseQuery (term with lower docFreq will also have less positions) and the safety for quickSort at all.

        Show
        Uwe Schindler added a comment - Here the patch that combines Robert's optimization for PhraseQuery (term with lower docFreq will also have less positions) and the safety for quickSort at all.
        Hide
        Uwe Schindler added a comment -

        Set fix versions (also backport to 3.1.1, as its serious for some large PhraseQueries and a serious slowdown then).

        Show
        Uwe Schindler added a comment - Set fix versions (also backport to 3.1.1, as its serious for some large PhraseQueries and a serious slowdown then).
        Hide
        Uwe Schindler added a comment -

        Sorry, the safety net is only needed at 40 (from my tests), before it may affect BytesRefHash performance.

        I will commit later!

        Show
        Uwe Schindler added a comment - Sorry, the safety net is only needed at 40 (from my tests), before it may affect BytesRefHash performance. I will commit later!
        Hide
        Uwe Schindler added a comment -

        Better test that fails faster in case of quickSort bug

        Show
        Uwe Schindler added a comment - Better test that fails faster in case of quickSort bug
        Hide
        Uwe Schindler added a comment -

        Final patch.

        After some discussion with robert: The use of QuickSort is fine after the comparator was fixed to not only sort by docFreq.

        Show
        Uwe Schindler added a comment - Final patch. After some discussion with robert: The use of QuickSort is fine after the comparator was fixed to not only sort by docFreq.
        Hide
        Uwe Schindler added a comment -

        Committed trunk revision: 1098633

        Now merging...

        Show
        Uwe Schindler added a comment - Committed trunk revision: 1098633 Now merging...
        Hide
        Uwe Schindler added a comment -

        Merged 3.x revision: 1098639
        Merged 3.1 revision: 1098641

        Show
        Uwe Schindler added a comment - Merged 3.x revision: 1098639 Merged 3.1 revision: 1098641
        Hide
        Michael McCandless added a comment -

        Reopening so we can discuss things further...:

        QuickSort is dangerous! Yet, it's definitely faster than MergeSort
        for some cases (~20% faster when sorting terms for writing segment, in
        quick test I ran on Wikipedia content).

        So the core issue is we should not use QS when there's a risk of any
        ties, because in that case it can run really slowly or hit infinite
        recursion.

        And we (well, Otis; thank you!) found one such place today (where
        MultiPhraseQuery sorts its terms) where we could have many ties and
        thus run very slowly / hit stack overflow.

        I appreciate the motivation for the "safety net", but, it makes me
        nervous... because, say we had done this a few months back... then
        Otis likely would not have reported the issue? Ie, the
        MultiPhraseQuery would run slowly... which could evade detection
        (people may just think it's slow).

        I prefer brittle fails over silent slowdowns because the brittle fail
        gets your attention and you get a real fix in. Silent slowdowns evade
        detection. Sort of like the difference between a virus and
        spyware...

        Also, what's preventing us from accidentally using QS somewhere in the
        future, where we shouldn't? What's going to catch us?

        Robert's first patch would catch this and protect us going forward?

        Or, maybe we could strengthen that approach and "assert cmp != 0"
        inside QS (ie, no ties are allowed to be passed to QS)?

        Though, using asserts only is risky, because it could be the
        comparator may return 0, but it's just that none of our test cases
        tickled it.

        Maybe instead we could do this in a type-safe way: make a new
        NoTiesComparator whose compare method can only return LESS_THAN or
        GREATER_THAN? And then QS would require NoTiesComparator. Could that
        work?

        Show
        Michael McCandless added a comment - Reopening so we can discuss things further...: QuickSort is dangerous! Yet, it's definitely faster than MergeSort for some cases (~20% faster when sorting terms for writing segment, in quick test I ran on Wikipedia content). So the core issue is we should not use QS when there's a risk of any ties, because in that case it can run really slowly or hit infinite recursion. And we (well, Otis; thank you!) found one such place today (where MultiPhraseQuery sorts its terms) where we could have many ties and thus run very slowly / hit stack overflow. I appreciate the motivation for the "safety net", but, it makes me nervous... because, say we had done this a few months back... then Otis likely would not have reported the issue? Ie, the MultiPhraseQuery would run slowly... which could evade detection (people may just think it's slow). I prefer brittle fails over silent slowdowns because the brittle fail gets your attention and you get a real fix in. Silent slowdowns evade detection. Sort of like the difference between a virus and spyware... Also, what's preventing us from accidentally using QS somewhere in the future, where we shouldn't? What's going to catch us? Robert's first patch would catch this and protect us going forward? Or, maybe we could strengthen that approach and "assert cmp != 0" inside QS (ie, no ties are allowed to be passed to QS)? Though, using asserts only is risky, because it could be the comparator may return 0, but it's just that none of our test cases tickled it. Maybe instead we could do this in a type-safe way: make a new NoTiesComparator whose compare method can only return LESS_THAN or GREATER_THAN? And then QS would require NoTiesComparator. Could that work?
        Hide
        Michael McCandless added a comment -

        Also, I think PQ.PostingsAndFreq.compare is still able to return ties, if the app puts the same term at the same position (which is a silly thing to do... but, still possible).

        I think instead of disambiguating by Term, we should disambiguate by ord (ie, position of this term in the array of the query itself), since that can never be the same for entries in the array?

        Show
        Michael McCandless added a comment - Also, I think PQ.PostingsAndFreq.compare is still able to return ties, if the app puts the same term at the same position (which is a silly thing to do... but, still possible). I think instead of disambiguating by Term, we should disambiguate by ord (ie, position of this term in the array of the query itself), since that can never be the same for entries in the array?
        Hide
        Dawid Weiss added a comment -

        I'm sure many of you know this, but there is a new implementation of mergesort in java.util.Collections – it is based on a few clever heuristics (so it is a merge sort, only a finely tuned one) and has been ported/ partially inspired by the sort in Python as far as I recall.

        Maybe it'd be sensible to compare against this and see what happens. I know Lucene/Solr would rather have its own implementation so that it doesn't rely on the standard library, but in my benchmarks the implementation in Collections.sort() was hard to beat...

        Show
        Dawid Weiss added a comment - I'm sure many of you know this, but there is a new implementation of mergesort in java.util.Collections – it is based on a few clever heuristics (so it is a merge sort, only a finely tuned one) and has been ported/ partially inspired by the sort in Python as far as I recall. Maybe it'd be sensible to compare against this and see what happens. I know Lucene/Solr would rather have its own implementation so that it doesn't rely on the standard library, but in my benchmarks the implementation in Collections.sort() was hard to beat...
        Hide
        Uwe Schindler added a comment -

        Dawid:
        There are two problems we have seen with native sort:

        • it copies the array/collection always first, this caused slowdown for lots of places especiall in automaton - so it never sorts in plcace
        • we sometimes need to sort multiple arrays in parallel, one as sort "key" -> especially in TermsHash/BytesRefHash. This is where SorterTemplate comes into the game: it supports separate swap(i,j) and compare(i,j) operations.

        Uwe

        Show
        Uwe Schindler added a comment - Dawid: There are two problems we have seen with native sort: it copies the array/collection always first, this caused slowdown for lots of places especiall in automaton - so it never sorts in plcace we sometimes need to sort multiple arrays in parallel, one as sort "key" -> especially in TermsHash/BytesRefHash. This is where SorterTemplate comes into the game: it supports separate swap(i,j) and compare(i,j) operations. Uwe
        Hide
        Dawid Weiss added a comment -

        Thanks Uwe, I didn't know about it. Still, the algorithm folks developing OpenJDK have implemented is public, so an improvement can be filed – maybe somebody will find the time to implement it in a version suitable for Lucene.

        http://en.wikipedia.org/wiki/Timsort

        Show
        Dawid Weiss added a comment - Thanks Uwe, I didn't know about it. Still, the algorithm folks developing OpenJDK have implemented is public, so an improvement can be filed – maybe somebody will find the time to implement it in a version suitable for Lucene. http://en.wikipedia.org/wiki/Timsort
        Hide
        Michael McCandless added a comment -

        So, there are two known improvements to our QS, to try to avoid the O(N^2)
        worst-case, both from Robert Sedgewick.

        First, it's better to select median of low/mid/high as the pivot
        (http://en.wikipedia.org/wiki/Quicksort#Choice_of_pivot). Second, we
        should handle "equal" values better
        (http://www.angelfire.com/pq/jamesbarbetti/articles/sorting/001_QuicksortIsBroken.htm#Duplicates).

        See also Lucy's nice QS impl:

        http://svn.apache.org/viewvc/incubator/lucy/trunk/core/Lucy/Util/SortUtils.c?revision=1098445&view=markup#l331

        which I think addresses the above two issues, and goes even further
        (eq-to-pivot values are explicitly "moved to the middle" and then not
        recursed on).

        The thing is, fixing these will make our QS more "general", at the
        expense of some added cost for the cases we know work fine today (eg
        sorting terms before flushing a segment).

        Maybe we leave our QS as is (except, changing the 40 to be dynamic
        depending on input length), noting that you should not use it if your
        comparator does not break ties, and even if it does there are still
        risks because of potentially bad pivot selection?

        Or, maybe we remove QS always use MS? Yes, there's a hit to the sort
        when flushing the segment, but this is a tiny cost compared to the
        rest of segment flushing...

        Separately we can look into whether the tool timsort is faster for
        sorting terms for flush....

        Show
        Michael McCandless added a comment - So, there are two known improvements to our QS, to try to avoid the O(N^2) worst-case, both from Robert Sedgewick. First, it's better to select median of low/mid/high as the pivot ( http://en.wikipedia.org/wiki/Quicksort#Choice_of_pivot ). Second, we should handle "equal" values better ( http://www.angelfire.com/pq/jamesbarbetti/articles/sorting/001_QuicksortIsBroken.htm#Duplicates ). See also Lucy's nice QS impl: http://svn.apache.org/viewvc/incubator/lucy/trunk/core/Lucy/Util/SortUtils.c?revision=1098445&view=markup#l331 which I think addresses the above two issues, and goes even further (eq-to-pivot values are explicitly "moved to the middle" and then not recursed on). The thing is, fixing these will make our QS more "general", at the expense of some added cost for the cases we know work fine today (eg sorting terms before flushing a segment). Maybe we leave our QS as is (except, changing the 40 to be dynamic depending on input length), noting that you should not use it if your comparator does not break ties, and even if it does there are still risks because of potentially bad pivot selection? Or, maybe we remove QS always use MS? Yes, there's a hit to the sort when flushing the segment, but this is a tiny cost compared to the rest of segment flushing... Separately we can look into whether the tool timsort is faster for sorting terms for flush....
        Hide
        Uwe Schindler added a comment -

        Maybe we leave our QS as is (except, changing the 40 to be dynamic
        depending on input length), noting that you should not use it if your
        comparator does not break ties, and even if it does there are still
        risks because of potentially bad pivot selection?

        That looks like this: http://en.wikipedia.org/wiki/Introsort

        We only need a good recursion depth where to switch!

        Show
        Uwe Schindler added a comment - Maybe we leave our QS as is (except, changing the 40 to be dynamic depending on input length), noting that you should not use it if your comparator does not break ties, and even if it does there are still risks because of potentially bad pivot selection? That looks like this: http://en.wikipedia.org/wiki/Introsort We only need a good recursion depth where to switch!
        Hide
        Uwe Schindler added a comment -

        Here a patch which implements what introsort does: if the depth of recursion is >75% of log2, switch to mergeSort.

        Also this patch moves all remaining quickSort calls to mergeSort on search side, where the comparators are not good. A few remaining ones in indexer keep alive, but those are all unique sets of terms or field names (needs some more review tomorrow).

        Mike: What do you think, maybe you can do some benchmarking?

        Show
        Uwe Schindler added a comment - Here a patch which implements what introsort does: if the depth of recursion is >75% of log2 , switch to mergeSort. Also this patch moves all remaining quickSort calls to mergeSort on search side, where the comparators are not good. A few remaining ones in indexer keep alive, but those are all unique sets of terms or field names (needs some more review tomorrow). Mike: What do you think, maybe you can do some benchmarking?
        Hide
        Uwe Schindler added a comment -

        Studying the C++ STL code showed that they use 2 * log2 as depth limit. I implemented that. It showed that for the most cases in Lucene (BytesRefHash), it uses quicksort (so no change to performance). The other cases use already mergeSort and the "bad" test in TestArrayUtil switches sucessfully to mergeSort.

        Show
        Uwe Schindler added a comment - Studying the C++ STL code showed that they use 2 * log2 as depth limit. I implemented that. It showed that for the most cases in Lucene (BytesRefHash), it uses quicksort (so no change to performance). The other cases use already mergeSort and the "bad" test in TestArrayUtil switches sucessfully to mergeSort.
        Hide
        Michael McCandless added a comment -

        Patch looks good! I like the 2*log_2(N) dynamic cutover; this means we can tolerate somewhat lopsided QS recursion and remain using QS.

        Show
        Michael McCandless added a comment - Patch looks good! I like the 2*log_2(N) dynamic cutover; this means we can tolerate somewhat lopsided QS recursion and remain using QS.
        Hide
        Uwe Schindler added a comment -

        Committed trunk revision: 1099041
        Merged 3.x revision: 1099045
        Merged 3.1 revision: 1099046

        Show
        Uwe Schindler added a comment - Committed trunk revision: 1099041 Merged 3.x revision: 1099045 Merged 3.1 revision: 1099046
        Hide
        Robert Muir added a comment -

        Bulk closing for 3.2

        Show
        Robert Muir added a comment - Bulk closing for 3.2

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development