Lucene - Core
  1. Lucene - Core
  2. LUCENE-3191

Add TopDocs.merge to merge multiple TopDocs

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      It's not easy today to merge TopDocs, eg produced by multiple shards,
      supporting arbitrary Sort.

      1. LUCENE-3191-SlowCollatorCompareFix.patch
        1 kB
        Uwe Schindler
      2. LUCENE-3191-SlowCollatorCompareFix.patch
        3 kB
        Uwe Schindler
      3. LUCENE-3191-3x.patch
        82 kB
        Michael McCandless
      4. LUCENE-3191.patch
        39 kB
        Michael McCandless
      5. LUCENE-3191.patch
        52 kB
        Michael McCandless
      6. LUCENE-3191.patch
        49 kB
        Michael McCandless
      7. LUCENE-3191.patch
        82 kB
        Michael McCandless
      8. LUCENE-3191.patch
        14 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch.

        The basic idea is simple (use PQ to find top N across all shards),
        but, I had to add FieldComparator.compare(Comparable, Comparable).
        Ie, the FieldComparator should be able to compare the Comparables
        returned by its value method.

        Show
        Michael McCandless added a comment - Patch. The basic idea is simple (use PQ to find top N across all shards), but, I had to add FieldComparator.compare(Comparable, Comparable). Ie, the FieldComparator should be able to compare the Comparables returned by its value method.
        Hide
        Uwe Schindler added a comment -

        The basic idea is simple (use PQ to find top N across all shards), but, I had to add FieldComparator.compare(Comparable, Comparable).

        That makes no sense to me, because Comparable<?> can always compare against each other without a separate comparator. The old MultiSearcher did exactly do this. This is why it returns Comparable<?>. So instead FieldComparator.compare(a, b) just use a.compareTo(b). It's in the responsibility of the Comparator to return a correctly wrapped Comparable.

        There might only be a bug in RelevanceComparator: Its getValue() method returns a comparable that sorts in wrong order. We have no test for this, so it might never cause a test failure.

        Show
        Uwe Schindler added a comment - The basic idea is simple (use PQ to find top N across all shards), but, I had to add FieldComparator.compare(Comparable, Comparable). That makes no sense to me, because Comparable<?> can always compare against each other without a separate comparator. The old MultiSearcher did exactly do this. This is why it returns Comparable<?>. So instead FieldComparator.compare(a, b) just use a.compareTo(b). It's in the responsibility of the Comparator to return a correctly wrapped Comparable. There might only be a bug in RelevanceComparator: Its getValue() method returns a comparable that sorts in wrong order. We have no test for this, so it might never cause a test failure.
        Hide
        Michael McCandless added a comment -

        Uwe, you are right! Now why didn't I think of that...

        The returned Comparable should be expected to properly compare itself to any other Comparable returned from FieldComparator.value... so I'll do that and then the patch is nice and small. And no API change for 3.x.

        Show
        Michael McCandless added a comment - Uwe, you are right! Now why didn't I think of that... The returned Comparable should be expected to properly compare itself to any other Comparable returned from FieldComparator.value... so I'll do that and then the patch is nice and small. And no API change for 3.x.
        Hide
        Michael McCandless added a comment -

        So... I started down this path (relying on the returned Comparable
        from .value to .compareTo themselves, instead of adding new .compare
        method to FieldComp), but I'm not sure I like it...

        I had to add a ReverseFloatComparable inside RelevanceComp, since it
        sorts opposite natural Float sort order by default.

        But then what this means, for an app that wants to do some sharding,
        suddenly a TopDocs might contain an instance of this class, whereas
        now it contains plain Java objects (Float, Integer, etc.).

        I also don't like that this is splitting up the logic of how relevacne
        scores compare to one another across two places (RelevanceComp and
        this new ReverseFloatComparable).

        I think it'd be better if we keep simple objects in the TopDocs, to
        keep it easy for apps to serialize themselves (since we don't impl
        Serializable anymore), and then the front end would invoke
        RelevanceComparator locally to properly compare the floats.

        Ie, really FieldComp.value should never have returned Comparable, I
        think?

        Show
        Michael McCandless added a comment - So... I started down this path (relying on the returned Comparable from .value to .compareTo themselves, instead of adding new .compare method to FieldComp), but I'm not sure I like it... I had to add a ReverseFloatComparable inside RelevanceComp, since it sorts opposite natural Float sort order by default. But then what this means, for an app that wants to do some sharding, suddenly a TopDocs might contain an instance of this class, whereas now it contains plain Java objects (Float, Integer, etc.). I also don't like that this is splitting up the logic of how relevacne scores compare to one another across two places (RelevanceComp and this new ReverseFloatComparable). I think it'd be better if we keep simple objects in the TopDocs, to keep it easy for apps to serialize themselves (since we don't impl Serializable anymore), and then the front end would invoke RelevanceComparator locally to properly compare the floats. Ie, really FieldComp.value should never have returned Comparable, I think?
        Hide
        Uwe Schindler added a comment -

        This still confuses me:

        There might only be a bug in RelevanceComparator: Its getValue() method returns a comparable that sorts in wrong order. We have no test for this, so it might never cause a test failure.

        In my opinion, it should return a negative Float object. But as far as I know, there was/is already some special case in the collectors merge code used to merge segment's results (FieldValueHitQueue.fillFields copys the values into the collected docs, but I am not sure if this is still used.

        The good old deprecated FieldDocSortedHitQueue in 3.x (what's the replacement?) contains this special case:

        } else {
          c = docA.fields[i].compareTo(docB.fields[i]);
          if (type == SortField.SCORE) {
            c = -c;
          }
        }
        

        In trunk it's gone, so we can maybe fix this stupidness. The Comparable returned by RelevanceComparator (used with SortField.SCORE) should simply be negative? Else we have to add this special case in your TopDocs.merge, too.

        Show
        Uwe Schindler added a comment - This still confuses me: There might only be a bug in RelevanceComparator: Its getValue() method returns a comparable that sorts in wrong order. We have no test for this, so it might never cause a test failure. In my opinion, it should return a negative Float object. But as far as I know, there was/is already some special case in the collectors merge code used to merge segment's results (FieldValueHitQueue.fillFields copys the values into the collected docs, but I am not sure if this is still used. The good old deprecated FieldDocSortedHitQueue in 3.x (what's the replacement?) contains this special case: } else { c = docA.fields[i].compareTo(docB.fields[i]); if (type == SortField.SCORE) { c = -c; } } In trunk it's gone, so we can maybe fix this stupidness. The Comparable returned by RelevanceComparator (used with SortField.SCORE) should simply be negative? Else we have to add this special case in your TopDocs.merge, too.
        Hide
        Uwe Schindler added a comment -

        By the way, in current trunk the value() method in FieldComparator is obsolete and slows down search, if the field values are not needed. But of course, this patch makes use of it again, but we should correct it.

        Show
        Uwe Schindler added a comment - By the way, in current trunk the value() method in FieldComparator is obsolete and slows down search, if the field values are not needed. But of course, this patch makes use of it again, but we should correct it.
        Hide
        Uwe Schindler added a comment -

        We had some discussions about cleaning this up in IRC: http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2011-06-10#l235

        Show
        Uwe Schindler added a comment - We had some discussions about cleaning this up in IRC: http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2011-06-10#l235
        Hide
        Michael McCandless added a comment -

        New patch:

        • Changes .value from Comparator (which is trappy because you think
          you're free to .compareTo them) to parameterized type passed to
          FieldComparator.
        • Renames .compare -> .compareValues, which are now type checked w/
          generic.
        • Changes FieldDoc.fields from Comparable[] to Object[]

        Will need to work out how we backport this to 3.x; the change from
        Comparable to Object is an API break, though... maybe not many apps
        are using FieldDoc.field.

        Show
        Michael McCandless added a comment - New patch: Changes .value from Comparator (which is trappy because you think you're free to .compareTo them) to parameterized type passed to FieldComparator. Renames .compare -> .compareValues, which are now type checked w/ generic. Changes FieldDoc.fields from Comparable[] to Object[] Will need to work out how we backport this to 3.x; the change from Comparable to Object is an API break, though... maybe not many apps are using FieldDoc.field.
        Hide
        Michael McCandless added a comment -

        New patch, adds default impl for FC.compareValues to just cast to Comparable<T> and call .compareTo. All but 2 places just use this default...

        Show
        Michael McCandless added a comment - New patch, adds default impl for FC.compareValues to just cast to Comparable<T> and call .compareTo. All but 2 places just use this default...
        Hide
        Uwe Schindler added a comment -

        Looks fine, I am happy now

        The RelevanceComparator should use simply to minimize unboxing:

        +    public int compareValues(Float first, Float second) {
        +      return second.compareTo(first); // reverse!
        +    }
        

        Will review more closely tomorrow!

        Show
        Uwe Schindler added a comment - Looks fine, I am happy now The RelevanceComparator should use simply to minimize unboxing: + public int compareValues( Float first, Float second) { + return second.compareTo(first); // reverse! + } Will review more closely tomorrow!
        Hide
        Michael McCandless added a comment -

        Thanks Uwe; I'll fix that.

        Show
        Michael McCandless added a comment - Thanks Uwe; I'll fix that.
        Hide
        Michael McCandless added a comment -

        New patch, addresses feedback from Uwe (thanks!), and also adds
        merging for TopGroups and SearchGroup to make distributed grouping
        easy(ier).

        I think it's ready to commit!

        Show
        Michael McCandless added a comment - New patch, addresses feedback from Uwe (thanks!), and also adds merging for TopGroups and SearchGroup to make distributed grouping easy(ier). I think it's ready to commit!
        Hide
        Michael McCandless added a comment -

        For 3.x, I think we should make an exception to back-compat and break the API (changing FieldComp.value(..) to return <T> not Comparable; changing FieldDoc.fields from Comparable[] to Object[]). I'll advertise the break in CHANGES.

        Show
        Michael McCandless added a comment - For 3.x, I think we should make an exception to back-compat and break the API (changing FieldComp.value(..) to return <T> not Comparable; changing FieldDoc.fields from Comparable[] to Object[]). I'll advertise the break in CHANGES.
        Hide
        Uwe Schindler added a comment -

        I think this has less impact on users. Two user types:

        • People using FieldDoc.fields[] would always cast the return type, so a simple recompile should be fine
        • People writing own FieldComparators must change return value of getValue() and maybe add generics (not required)
        • People that dont implement compareValue() will be also fine, as the default impl casts to Comparable and that will have the same behaviour

        The 3.x impl just have to fix FieldDocSortedHitQueue to use compareValue() and remove the negation for scores.

        Show
        Uwe Schindler added a comment - I think this has less impact on users. Two user types: People using FieldDoc.fields[] would always cast the return type, so a simple recompile should be fine People writing own FieldComparators must change return value of getValue() and maybe add generics (not required) People that dont implement compareValue() will be also fine, as the default impl casts to Comparable and that will have the same behaviour The 3.x impl just have to fix FieldDocSortedHitQueue to use compareValue() and remove the negation for scores.
        Hide
        Michael McCandless added a comment -

        Patch for merging back to 3.x.

        Show
        Michael McCandless added a comment - Patch for merging back to 3.x.
        Hide
        Uwe Schindler added a comment -

        Patch looks good, let the BackwardsPoliceman think about some possibilities to lower the risk of breaking code. Of course nothing sophisticated...

        Show
        Uwe Schindler added a comment - Patch looks good, let the BackwardsPoliceman think about some possibilities to lower the risk of breaking code. Of course nothing sophisticated...
        Hide
        Michael McCandless added a comment -

        Small further patch for trunk:

        • Simplifies the API by moving shardIndex onto ScoreDoc
        • Fixes TopDocs.merge to return TopFieldDocs if the Sort != null
        • A couple FieldComparators must override compareValue because the
          values may be null.
        Show
        Michael McCandless added a comment - Small further patch for trunk: Simplifies the API by moving shardIndex onto ScoreDoc Fixes TopDocs.merge to return TopFieldDocs if the Sort != null A couple FieldComparators must override compareValue because the values may be null.
        Hide
        Michael McCandless added a comment -

        Thanks Uwe!

        Show
        Michael McCandless added a comment - Thanks Uwe!
        Hide
        Robert Muir added a comment -

        Reopening: this code in SlowCollatedStringComparator is totally broken:

          @Override
          public int compareValues(BytesRef first, BytesRef second) {
            if (first == null) {
              if (second == null) {
                return 0;
              }
              return -1;
            } else if (second == null) {
              return 1;
            } else {
              return collator.compare(first, second);
            }
          }
        

        I haven't tracked this issue to understand whats going on here, but you cannot pass BytesRefs to collator.compare. If this code is ever reached (and looking at the test i wrote for this damn thing, its unclear if this code is even necessary?!), it will throw ClassCastException:
        http://download.oracle.com/javase/1,5.0/docs/api/java/text/Collator.html#compare(java.lang.Object, java.lang.Object)

        Show
        Robert Muir added a comment - Reopening: this code in SlowCollatedStringComparator is totally broken: @Override public int compareValues(BytesRef first, BytesRef second) { if (first == null) { if (second == null) { return 0; } return -1; } else if (second == null) { return 1; } else { return collator.compare(first, second); } } I haven't tracked this issue to understand whats going on here, but you cannot pass BytesRefs to collator.compare. If this code is ever reached (and looking at the test i wrote for this damn thing, its unclear if this code is even necessary?!), it will throw ClassCastException: http://download.oracle.com/javase/1,5.0/docs/api/java/text/Collator.html#compare(java.lang.Object , java.lang.Object)
        Hide
        Uwe Schindler added a comment -

        I looked at the code, it was ununderstandable why this thing was generified to BytesRef at all. As it compares String, getValue() and compareValue() should take Strings.

        And thats done in the patch, its even a little bit no so slow

        Show
        Uwe Schindler added a comment - I looked at the code, it was ununderstandable why this thing was generified to BytesRef at all. As it compares String, getValue() and compareValue() should take Strings. And thats done in the patch, its even a little bit no so slow
        Hide
        Uwe Schindler added a comment - - edited

        I haven't tracked this issue to understand whats going on here, but you cannot pass BytesRefs to collator.compare. If this code is ever reached (and looking at the test i wrote for this damn thing, its unclear if this code is even necessary?!)

        The code is necessary once you cross segments during compare merge TopDocs. Maybe the index in the tests is so small that it never has multiple segments.

        The clover report shows that this is not the case, it even shows that compare does not use compareBottom():
        https://builds.apache.org/job/Lucene-trunk/clover/org/apache/lucene/search/SlowCollatedStringComparator.html

        It never ever calls even compareBottom() so this index is really small. We should make it bigger to test correctly. The getValue() method is only called because the actual String (previously: BytesRef) values are placed in TopFieldDocs.

        Show
        Uwe Schindler added a comment - - edited I haven't tracked this issue to understand whats going on here, but you cannot pass BytesRefs to collator.compare. If this code is ever reached (and looking at the test i wrote for this damn thing, its unclear if this code is even necessary?!) The code is necessary once you cross segments during compare merge TopDocs. Maybe the index in the tests is so small that it never has multiple segments. The clover report shows that this is not the case, it even shows that compare does not use compareBottom(): https://builds.apache.org/job/Lucene-trunk/clover/org/apache/lucene/search/SlowCollatedStringComparator.html It never ever calls even compareBottom() so this index is really small. We should make it bigger to test correctly. The getValue() method is only called because the actual String (previously: BytesRef) values are placed in TopFieldDocs.
        Hide
        Uwe Schindler added a comment -

        New patch that also checks the getValue/compareValues methods in the comparator.

        Sorry, my earlier comment about too small indexes was wrong. The indexes in fact have segments. The problem was that I was irritated by the former MultiSearcher code that used these compareValues() code. In trunk, only TopDocs.merge() does this.

        So I changed the test:

        • It does not use MatchAllDocsQuery, instead two TermRangeQueries with a random center point, so it splits the index into two parts. Each result is sorted and checked that its valid.
        • With the two result sets it calls TopDocs.merge() and again checks the result. This call will invoke the value methods and here the test fails as Robert said.
        • For each search call it does not always set the maximum number of docs as PQ size (else compareBottom/setBottom) are not tested. It reduces the PQ size to a fraction of numDocs for all searches.

        In general we should add similar tests to all custom comparators in Lucene's code, because the comparator can only be tested completely if you also check merging TopDocs.

        Show
        Uwe Schindler added a comment - New patch that also checks the getValue/compareValues methods in the comparator. Sorry, my earlier comment about too small indexes was wrong. The indexes in fact have segments. The problem was that I was irritated by the former MultiSearcher code that used these compareValues() code. In trunk, only TopDocs.merge() does this. So I changed the test: It does not use MatchAllDocsQuery, instead two TermRangeQueries with a random center point, so it splits the index into two parts. Each result is sorted and checked that its valid. With the two result sets it calls TopDocs.merge() and again checks the result. This call will invoke the value methods and here the test fails as Robert said. For each search call it does not always set the maximum number of docs as PQ size (else compareBottom/setBottom) are not tested. It reduces the PQ size to a fraction of numDocs for all searches. In general we should add similar tests to all custom comparators in Lucene's code, because the comparator can only be tested completely if you also check merging TopDocs.
        Hide
        Robert Muir added a comment -

        ahh thank you!!! I was very confused at why this method was untested!

        Show
        Robert Muir added a comment - ahh thank you!!! I was very confused at why this method was untested!
        Hide
        Michael McCandless added a comment -

        Thanks Uwe and Robert – good catch! Scary.

        Maybe, to better test TopDocs.merge, we could fix newSearcher (AssertingIndexSearcher) to, sometimes, break into N searches on the sub-searchers and then merge the results, from its search methods?

        Show
        Michael McCandless added a comment - Thanks Uwe and Robert – good catch! Scary. Maybe, to better test TopDocs.merge, we could fix newSearcher (AssertingIndexSearcher) to, sometimes, break into N searches on the sub-searchers and then merge the results, from its search methods?
        Hide
        Uwe Schindler added a comment -

        Committed fix trunk rev 1140713

        Show
        Uwe Schindler added a comment - Committed fix trunk rev 1140713
        Hide
        Uwe Schindler added a comment -

        Mike: This is a good idea, not sure if this could break scores, ranking and other things

        Show
        Uwe Schindler added a comment - Mike: This is a good idea, not sure if this could break scores, ranking and other things
        Hide
        Robert Muir added a comment -

        also remember, when you call newSearcher, it optionally takes a default boolean parameter (maybeWrap, default=true).

        So tests sensitive to stuff like this can just pass false here... in general I think beefing up newSearcher with stuff like this is a good idea, so far adding logic here (randomly use executor, randomly use slowwrapper, etc) has always found some bugs... eventually

        Show
        Robert Muir added a comment - also remember, when you call newSearcher, it optionally takes a default boolean parameter (maybeWrap, default=true). So tests sensitive to stuff like this can just pass false here... in general I think beefing up newSearcher with stuff like this is a good idea, so far adding logic here (randomly use executor, randomly use slowwrapper, etc) has always found some bugs... eventually
        Hide
        Michael McCandless added a comment -

        I added this to the TestIdeas wiki page.

        Show
        Michael McCandless added a comment - I added this to the TestIdeas wiki page.
        Hide
        Michael McCandless added a comment -

        This can be done w/o breaking scores – we can create a top-level weight, and then search against the subs (this is how TestTopDocs works).

        Show
        Michael McCandless added a comment - This can be done w/o breaking scores – we can create a top-level weight, and then search against the subs (this is how TestTopDocs works).
        Hide
        Robert Muir added a comment -

        bulk close for 3.3

        Show
        Robert Muir added a comment - bulk close for 3.3

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development