Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      As a next step of LUCENE-5527, it would be nice to have per-segment comparators, and maybe even change the default behavior of our top* comparators so that they merge top hits in the end.

      1. LUCENE-5702.patch
        191 kB
        Adrien Grand
      2. LUCENE-5702.patch
        180 kB
        Adrien Grand
      3. SortBench.java
        6 kB
        Adrien Grand

        Activity

        Hide
        Michael McCandless added a comment -

        +1, it would be great to massively simplify the scary FieldComparator API that we have today.

        Show
        Michael McCandless added a comment - +1, it would be great to massively simplify the scary FieldComparator API that we have today.
        Hide
        Adrien Grand added a comment -

        Here is a patch. It splits the FieldComparator API into FieldComparator and LeafFieldComparator (like Collector and LeafCollector).

        As a consequence our TopDocsCollectors do not extend SimpleCollector anymore and return a new LeafCollector instance per leaf.

        TopFieldCollector was a bit of a pain to migrate because of all the specialization, so I replaced the OneComparator/MultiComparator specialization with some wrapping of the LeafFieldComparator API so that it always seems like there is a single comparator to the TopFieldCollector (only the leaf comparators are wrapped, not the top-level ones so that the situation is not weird because of the value(int slot) method that can return a single value). This helped remove a significant amount of code (7 collectors instead of 13).

        Some grouping/join collectors were a bit hard to migrate so for now they still extend SimpleCollector and keep a reference to the currently wrapped leaf comparator/collector. I think this can be addressed later?

        For now the patch only changes the API but we could imagine doing funny things now that the top docs collectors and comparators are fully per-segment, like only counting hits on segments that are not competitive (because of their min/max value or because they don't have a value for a given field – if we removed the constraint that TopDocsCollector needs to return the total hit count, we could even completely skip such segments) or not worrying about the reader generation of the first segment when using ordinals to sort.

        Show
        Adrien Grand added a comment - Here is a patch. It splits the FieldComparator API into FieldComparator and LeafFieldComparator (like Collector and LeafCollector). As a consequence our TopDocsCollectors do not extend SimpleCollector anymore and return a new LeafCollector instance per leaf. TopFieldCollector was a bit of a pain to migrate because of all the specialization, so I replaced the OneComparator/MultiComparator specialization with some wrapping of the LeafFieldComparator API so that it always seems like there is a single comparator to the TopFieldCollector (only the leaf comparators are wrapped, not the top-level ones so that the situation is not weird because of the value(int slot) method that can return a single value). This helped remove a significant amount of code (7 collectors instead of 13). Some grouping/join collectors were a bit hard to migrate so for now they still extend SimpleCollector and keep a reference to the currently wrapped leaf comparator/collector. I think this can be addressed later? For now the patch only changes the API but we could imagine doing funny things now that the top docs collectors and comparators are fully per-segment, like only counting hits on segments that are not competitive (because of their min/max value or because they don't have a value for a given field – if we removed the constraint that TopDocsCollector needs to return the total hit count, we could even completely skip such segments) or not worrying about the reader generation of the first segment when using ordinals to sort.
        Hide
        David Smiley added a comment -

        +1, it would be great to massively simplify the scary FieldComparator API that we have today.

        +1 to per-segment. I took a quick look at the patch. I don't see a simplification, which is not a complaint but just wondering why you feel things are simpler Mike.

        I had to implement my own FieldComparator yesterday and it was a bit of a pain, particularly because of the primitive type'ing inhibits a generic useful base class, and so I had to write my own. It would be nice if there was an int, long, float, and double set of base classes. There is kind of a set of these but they assume they fetch from DocValues. I think with some simple changes they could be generic, and only one abstract method to fetch the primitive value for the current document.

        Show
        David Smiley added a comment - +1, it would be great to massively simplify the scary FieldComparator API that we have today. +1 to per-segment. I took a quick look at the patch. I don't see a simplification , which is not a complaint but just wondering why you feel things are simpler Mike. I had to implement my own FieldComparator yesterday and it was a bit of a pain, particularly because of the primitive type'ing inhibits a generic useful base class, and so I had to write my own. It would be nice if there was an int, long, float, and double set of base classes. There is kind of a set of these but they assume they fetch from DocValues. I think with some simple changes they could be generic, and only one abstract method to fetch the primitive value for the current document.
        Hide
        Adrien Grand added a comment -

        I don't see a simplification, which is not a complaint but just wondering why you feel things are simpler Mike.

        Indeed there is no real simplification here, it just splits the API into two classes.

        I had to implement my own FieldComparator yesterday and it was a bit of a pain, particularly because of the primitive type'ing inhibits a generic useful base class, and so I had to write my own. It would be nice if there was an int, long, float, and double set of base classes. There is kind of a set of these but they assume they fetch from DocValues. I think with some simple changes they could be generic, and only one abstract method to fetch the primitive value for the current document.

        Maybe you could extend an existing FieldComparator to pull a NumericDocValues instance differently (or IntValues/LongValues/... on 4.x) like SortedNumericSortField does?

        Show
        Adrien Grand added a comment - I don't see a simplification, which is not a complaint but just wondering why you feel things are simpler Mike. Indeed there is no real simplification here, it just splits the API into two classes. I had to implement my own FieldComparator yesterday and it was a bit of a pain, particularly because of the primitive type'ing inhibits a generic useful base class, and so I had to write my own. It would be nice if there was an int, long, float, and double set of base classes. There is kind of a set of these but they assume they fetch from DocValues. I think with some simple changes they could be generic, and only one abstract method to fetch the primitive value for the current document. Maybe you could extend an existing FieldComparator to pull a NumericDocValues instance differently (or IntValues/LongValues/... on 4.x) like SortedNumericSortField does?
        Hide
        David Smiley added a comment -

        Maybe you could extend an existing FieldComparator to pull a NumericDocValues instance differently (or IntValues/LongValues/... on 4.x) like SortedNumericSortField does?

        I considered that but FieldComparator.LongComparator's constructor is package private, so that's a deal-breaker. In trunk (and 5x?) it's public.

        In addition, to handle a more generic case, it's problematic because there's a FieldCache/DV lookup and a Bits docsWithField lookup (and this logic is annoyingly repeated several times). See the snippet below. In my case, I'm sorting indirectly by looking up values in another document and so I need to map the current doc id to a target id. I certainly don't want that to happen twice for the same document (both the FieldCache/DV lookup and the docsWithField call). The following logic could be extracted into a method that a subclass could override, like hypothetical method getValue(doc):

              long v2 = currentReaderValues.get(doc);
              // Test for v2 == 0 to save Bits.get method call for
              // the common case (doc has value and value is non-zero):
              if (docsWithField != null && v2 == 0 && !docsWithField.get(doc)) {
                v2 = missingValue;
              }
        

        Two examples of a FieldComparator instances in Lucene that could benefit from some re-usable base classes are ExpressionComparator and ValueSourceComparator. There are probably others.

        Show
        David Smiley added a comment - Maybe you could extend an existing FieldComparator to pull a NumericDocValues instance differently (or IntValues/LongValues/... on 4.x) like SortedNumericSortField does? I considered that but FieldComparator.LongComparator's constructor is package private, so that's a deal-breaker. In trunk (and 5x?) it's public. In addition, to handle a more generic case, it's problematic because there's a FieldCache/DV lookup and a Bits docsWithField lookup (and this logic is annoyingly repeated several times). See the snippet below. In my case, I'm sorting indirectly by looking up values in another document and so I need to map the current doc id to a target id. I certainly don't want that to happen twice for the same document (both the FieldCache/DV lookup and the docsWithField call). The following logic could be extracted into a method that a subclass could override, like hypothetical method getValue(doc): long v2 = currentReaderValues.get(doc); // Test for v2 == 0 to save Bits.get method call for // the common case (doc has value and value is non-zero): if (docsWithField != null && v2 == 0 && !docsWithField.get(doc)) { v2 = missingValue; } Two examples of a FieldComparator instances in Lucene that could benefit from some re-usable base classes are ExpressionComparator and ValueSourceComparator. There are probably others.
        Hide
        Adrien Grand added a comment -

        I considered that but FieldComparator.LongComparator's constructor is package private

        What version are you using? It seems to be public in the latest release.

        I'm sorting indirectly by looking up values in another document

        Elasticsearch does something similar by allowing to sort eg. by the minimum value of all documents contained in a given block, maybe there is some logic you can reuse? https://github.com/elasticsearch/elasticsearch/blob/1.4/src/main/java/org/elasticsearch/index/fielddata/fieldcomparator/LongValuesComparatorSource.java#L64

        Two examples of a FieldComparator instances in Lucene that could benefit from some re-usable base classes are ExpressionComparator and ValueSourceComparator. There are probably others.

        +1 to improve code reuse across comparators!

        Show
        Adrien Grand added a comment - I considered that but FieldComparator.LongComparator's constructor is package private What version are you using? It seems to be public in the latest release. I'm sorting indirectly by looking up values in another document Elasticsearch does something similar by allowing to sort eg. by the minimum value of all documents contained in a given block, maybe there is some logic you can reuse? https://github.com/elasticsearch/elasticsearch/blob/1.4/src/main/java/org/elasticsearch/index/fielddata/fieldcomparator/LongValuesComparatorSource.java#L64 Two examples of a FieldComparator instances in Lucene that could benefit from some re-usable base classes are ExpressionComparator and ValueSourceComparator. There are probably others. +1 to improve code reuse across comparators!
        Hide
        David Smiley added a comment -

        What version are you using? It seems to be public in the latest release.

        4.8.1. I see it's now public in 4.10, perhaps it was 4.9.

        Elasticsearch does something similar by allowing to sort eg. by the minimum value of all documents contained in a given block, maybe there is some logic you can reuse? https://github.com/elasticsearch/elasticsearch/blob/1.4/src/main/java/org/elasticsearch/index/fielddata/fieldcomparator/LongValuesComparatorSource.java#L64

        Thanks for the pointer. But my situation can't assume a given block and must look in other segments. I realize it could be much faster if it looked within the block but that would require re-indexing all documents in a parent-child set when any change, and I'd rather not make that trade-off for this app.

        +1 to improve code reuse across comparators!

        Ok, maybe I'll add something when I get time. Time... the ultimate luxury!

        Show
        David Smiley added a comment - What version are you using? It seems to be public in the latest release. 4.8.1. I see it's now public in 4.10, perhaps it was 4.9. Elasticsearch does something similar by allowing to sort eg. by the minimum value of all documents contained in a given block, maybe there is some logic you can reuse? https://github.com/elasticsearch/elasticsearch/blob/1.4/src/main/java/org/elasticsearch/index/fielddata/fieldcomparator/LongValuesComparatorSource.java#L64 Thanks for the pointer. But my situation can't assume a given block and must look in other segments. I realize it could be much faster if it looked within the block but that would require re-indexing all documents in a parent-child set when any change, and I'd rather not make that trade-off for this app. +1 to improve code reuse across comparators! Ok, maybe I'll add something when I get time. Time... the ultimate luxury!
        Hide
        Adrien Grand added a comment -

        Updated patch to current trunk. I also did some benchmarking and the removal of the one-comparator specialization had a bad impact on performance so I added it back, we could discuss the over-specialization of top-field collectors in a different issue...

        You can find attached the (dummy) benchmark that I used to check the performance impact of this patch. Times are in milliseconds (the smaller the better).

        sort trunk patch difference
        long asc 100 108 +8%
        long desc 101 110 +9%
        double asc 107 114 +7%
        double desc 113 118 +4%
        string asc 119 123 +3%
        string desc 120 124 +3%
        long asc, double asc 98 87 -11%
        long desc, double desc 102 89 -13%

        Some cases are slightly faster, others are slightly slower. This benchmark only runs a sort to find the top 50 hits on a MatchAllDocsQuery, so differences would be even smaller if you run an actual query and/or have other collectors (eg. if you also want to compute facets).

        This patch is *only* about API. It just splits FieldComparator into

        • FieldComparator:
          • compare(int slot1, int slot2)
          • void setTopValue(T value)
          • T value(int slot)
          • LeafFieldComparator getLeafComparator(LeafReaderContext context)
        • and LeafFieldComparator:
          • int compareBottom(int doc)
          • int compareTop(int doc)
          • void copy(int slot, int doc)
          • void setScorer(Scorer scorer)

        All the logic about top-field collection is left unchanged. So there is still a single top-level priority queue that all leaf collectors update. I think changing the API is important for several reasons:

        • it makes the FieldComparator API aligned with the Collector API (LeafCollector <-> LeafFieldComparator)
        • it makes the workflow easier to understand: you need to get a LeafFieldComparator before you can call setScorer
        • Even if the patch does not contain any optimization, it would make per-segment optimizations easier. For instance, if all documents in a segment have the same value, we could ignore this sort field in comparisons. Or if an index has a single segment, we could decide to only use ordinals for comparisons and avoid copying values on each competitive hit.
        Show
        Adrien Grand added a comment - Updated patch to current trunk. I also did some benchmarking and the removal of the one-comparator specialization had a bad impact on performance so I added it back, we could discuss the over-specialization of top-field collectors in a different issue... You can find attached the (dummy) benchmark that I used to check the performance impact of this patch. Times are in milliseconds (the smaller the better). sort trunk patch difference long asc 100 108 +8% long desc 101 110 +9% double asc 107 114 +7% double desc 113 118 +4% string asc 119 123 +3% string desc 120 124 +3% long asc, double asc 98 87 -11% long desc, double desc 102 89 -13% Some cases are slightly faster, others are slightly slower. This benchmark only runs a sort to find the top 50 hits on a MatchAllDocsQuery , so differences would be even smaller if you run an actual query and/or have other collectors (eg. if you also want to compute facets). This patch is * only * about API. It just splits FieldComparator into FieldComparator: compare(int slot1, int slot2) void setTopValue(T value) T value(int slot) LeafFieldComparator getLeafComparator(LeafReaderContext context) and LeafFieldComparator: int compareBottom(int doc) int compareTop(int doc) void copy(int slot, int doc) void setScorer(Scorer scorer) All the logic about top-field collection is left unchanged. So there is still a single top-level priority queue that all leaf collectors update. I think changing the API is important for several reasons: it makes the FieldComparator API aligned with the Collector API (LeafCollector <-> LeafFieldComparator) it makes the workflow easier to understand: you need to get a LeafFieldComparator before you can call setScorer Even if the patch does not contain any optimization, it would make per-segment optimizations easier. For instance, if all documents in a segment have the same value, we could ignore this sort field in comparisons. Or if an index has a single segment, we could decide to only use ordinals for comparisons and avoid copying values on each competitive hit.
        Hide
        ASF subversion and git services added a comment -

        Commit 1649818 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1649818 ]

        LUCENE-5702: Move comparators to a per-leaf API.

        Show
        ASF subversion and git services added a comment - Commit 1649818 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1649818 ] LUCENE-5702 : Move comparators to a per-leaf API.
        Hide
        ASF subversion and git services added a comment -

        Commit 1649824 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1649824 ]

        LUCENE-5702: Move comparators to a per-leaf API.

        Show
        ASF subversion and git services added a comment - Commit 1649824 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1649824 ] LUCENE-5702 : Move comparators to a per-leaf API.
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development