Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7537

Add multi valued field support to index sorting

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: master (7.0), 6.4
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today index sorting can be done on single valued field through the NumericDocValues (for numerics) and SortedDocValues (for strings).
      I'd like to add the ability to sort on multi valued fields. Since index sorting does not accept custom comparator we could just take the minimum value of each document for an ascending sort and the maximum value for a descending sort.
      This way we could handle all cases instead of throwing an exception during a merge when we encounter a multi valued DVs.

      1. LUCENE-7537.patch
        100 kB
        Jim Ferenczi
      2. LUCENE-7537.patch
        99 kB
        Jim Ferenczi
      3. LUCENE-7537.patch
        99 kB
        Jim Ferenczi
      4. LUCENE-7537.patch
        89 kB
        Jim Ferenczi
      5. LUCENE-7537.patch
        139 kB
        Jim Ferenczi

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        I think this makes sense.

        It seems like we just need to handle serializing of SortedSetSortField and SortedNumericSortField, with their selectors, as well?

        Show
        mikemccand Michael McCandless added a comment - I think this makes sense. It seems like we just need to handle serializing of SortedSetSortField and SortedNumericSortField , with their selectors, as well?
        Hide
        mikemccand Michael McCandless added a comment -

        I'll work on this ...

        Show
        mikemccand Michael McCandless added a comment - I'll work on this ...
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Oh I've already started to work on a patch with the logic described above
        I'll post it shortly. Thanks Michael McCandless.

        Show
        jim.ferenczi Jim Ferenczi added a comment - Oh I've already started to work on a patch with the logic described above I'll post it shortly. Thanks Michael McCandless .
        Hide
        mikemccand Michael McCandless added a comment -

        Oh, no worries Jim Ferenczi! Thank you for tackling this.

        Show
        mikemccand Michael McCandless added a comment - Oh, no worries Jim Ferenczi ! Thank you for tackling this.
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Here is a simple patch that adds the support for multi valued sort directly in SortField. It defines 5 new sort types: sorted_string, sorted_long, sorted_double, sorted_float, sorted_int and uses the Sorted

        {Set|Numeric}

        Selector for sorting. The natural order picks the minimum value of the list for each document and the reverse order picks the maximum.

        This patch also fixes a small bug which showed up in unit tests when using an index sorting with reverse sort and a missing value.

        Show
        jim.ferenczi Jim Ferenczi added a comment - Here is a simple patch that adds the support for multi valued sort directly in SortField. It defines 5 new sort types: sorted_string, sorted_long, sorted_double, sorted_float, sorted_int and uses the Sorted {Set|Numeric} Selector for sorting. The natural order picks the minimum value of the list for each document and the reverse order picks the maximum. This patch also fixes a small bug which showed up in unit tests when using an index sorting with reverse sort and a missing value.
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Jim Ferenczi, I'll have a look.

        Show
        mikemccand Michael McCandless added a comment - Thanks Jim Ferenczi , I'll have a look.
        Hide
        jpountz Adrien Grand added a comment -

        It defines 5 new sort types: sorted_string, sorted_long, sorted_double, sorted_float, sorted_int and uses the Sorted{Set|Numeric}Selector for sorting.

        The new types do not look useful to me? For instance, DocValues.getSortedSet falls back to LeafReader.getSortedDocValues if the reader does not have SORTED_SET doc values, so all the code that you protected under eg. if (sortField.getType() == SortField.Type.SORTED_STRING) would also work with single-valued (SORTED) doc values (same for SORTED_NUMERIC and NUMERIC doc values).

        Show
        jpountz Adrien Grand added a comment - It defines 5 new sort types: sorted_string, sorted_long, sorted_double, sorted_float, sorted_int and uses the Sorted{Set|Numeric}Selector for sorting. The new types do not look useful to me? For instance, DocValues.getSortedSet falls back to LeafReader.getSortedDocValues if the reader does not have SORTED_SET doc values, so all the code that you protected under eg. if (sortField.getType() == SortField.Type.SORTED_STRING) would also work with single-valued ( SORTED ) doc values (same for SORTED_NUMERIC and NUMERIC doc values).
        Hide
        mikemccand Michael McCandless added a comment -

        Hmm, I think we can take an alternate approach that does not add need to add new SortField.Type enum values, by extending IWC.setIndexSort]} to also allow {{SortedSetSortField and SortedNumericSortField?

        When serializing, we would need to check for those specific classes, and also serialize/deserialize their selectors.

        Show
        mikemccand Michael McCandless added a comment - Hmm, I think we can take an alternate approach that does not add need to add new SortField.Type enum values, by extending IWC.setIndexSort]} to also allow {{SortedSetSortField and SortedNumericSortField ? When serializing, we would need to check for those specific classes, and also serialize/deserialize their selectors.
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        > The new types do not look useful to me?

        It's to differentiate the underlying DVs and also because I didn't want to change the expectation of the native sort. Though I am totally for a single type that accepts both DVs if changing the SortField native types is ok.

        > For instance, DocValues.getSortedSet falls back to LeafReader.getSortedDocValues if the reader does not have SORTED_SET doc values, so all the code that you protected under eg. if (sortField.getType() == SortField.Type.SORTED_STRING) would also work with single-valued (SORTED) doc values (same for SORTED_NUMERIC and NUMERIC doc values).

        The leniency is here to catch SortedSetDocValues that ends up with a single value per field. But yes it's another point for the merged type.

        Show
        jim.ferenczi Jim Ferenczi added a comment - > The new types do not look useful to me? It's to differentiate the underlying DVs and also because I didn't want to change the expectation of the native sort. Though I am totally for a single type that accepts both DVs if changing the SortField native types is ok. > For instance, DocValues.getSortedSet falls back to LeafReader.getSortedDocValues if the reader does not have SORTED_SET doc values, so all the code that you protected under eg. if (sortField.getType() == SortField.Type.SORTED_STRING) would also work with single-valued (SORTED) doc values (same for SORTED_NUMERIC and NUMERIC doc values). The leniency is here to catch SortedSetDocValues that ends up with a single value per field. But yes it's another point for the merged type.
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Thanks Michael McCandless. I tried this approach and then added the types to clean up the serialization and the index sorting check I can totally revert to the first version which does what you say.

        Show
        jim.ferenczi Jim Ferenczi added a comment - Thanks Michael McCandless . I tried this approach and then added the types to clean up the serialization and the index sorting check I can totally revert to the first version which does what you say.
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Jim Ferenczi, I think it's better to have the codec take on a bit more complexity for the serialization in exchange for keeping our public sorting APIs simpler...

        Show
        mikemccand Michael McCandless added a comment - Thanks Jim Ferenczi , I think it's better to have the codec take on a bit more complexity for the serialization in exchange for keeping our public sorting APIs simpler...
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        I published a new patch which adds index sort support for SortedSetSortField and SortedNumericSortField.
        Michael McCandless can you take a look ?

        Show
        jim.ferenczi Jim Ferenczi added a comment - I published a new patch which adds index sort support for SortedSetSortField and SortedNumericSortField. Michael McCandless can you take a look ?
        Hide
        mikemccand Michael McCandless added a comment -

        Thank Jim Ferenczi, I'll have a look.

        Show
        mikemccand Michael McCandless added a comment - Thank Jim Ferenczi , I'll have a look.
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Jim Ferenczi this looks cleaner!

        In Lucene62SegmentInfoFormat.java, when we throw
        IllegalArgumentException, can we change it to include the sortField,
        not just its .getType()?

        I think you need to fix SimpleTextCodec too? I hit this failure:

           [junit4] Suite: org.apache.lucene.codecs.simpletext.TestSimpleTextSegmentInfoFormat
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestSimpleTextSegmentInfoFormat -Dtests.method=testSort -Dtests.seed=61D2298FBC9DEB3E -Dtests.locale=el-CY -Dtests.timezone=America/Indiana/Petersburg -Dtests.asserts=true -Dtests.file.encoding=UTF-8
           [junit4] ERROR   0.01s J0 | TestSimpleTextSegmentInfoFormat.testSort <<<
           [junit4]    > Throwable #1: java.lang.IllegalStateException: Unexpected sort type: CUSTOM
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([61D2298FBC9DEB3E:303701A677D0F12E]:0)
           [junit4]    > 	at org.apache.lucene.codecs.simpletext.SimpleTextSegmentInfoFormat.write(SimpleTextSegmentInfoFormat.java:373)
           [junit4]    > 	at org.apache.lucene.index.BaseSegmentInfoFormatTestCase.testSort(BaseSegmentInfoFormatTestCase.java:268)
           [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
           [junit4]   2> NOTE: leaving temporary files on disk at: /l/jim/lucene/build/codecs/test/J0/temp/lucene.codecs.simpletext.TestSimpleTextSegmentInfoFormat_61D22
        

        The new CorruptIndexException s thrown in
        Lucene62SegmentInfoFormat.java have the wrong message I think?
        Shouldn't it be something like "invalid SortedSetSelector type: " + type ?

        Can you bump the version value in Lucene62SegmentInfoFormat.java,
        and set VERSION_CURRENT to the new version? We need to do this
        when we make any index format change so that if e.g. and old Lucene
        version tries to read a newer index (written with this change) they
        see an IndexFormatTooNewException and not
        CorruptIndexException.

        Show
        mikemccand Michael McCandless added a comment - Thanks Jim Ferenczi this looks cleaner! In Lucene62SegmentInfoFormat.java , when we throw IllegalArgumentException , can we change it to include the sortField, not just its .getType()? I think you need to fix SimpleTextCodec too? I hit this failure: [junit4] Suite: org.apache.lucene.codecs.simpletext.TestSimpleTextSegmentInfoFormat [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestSimpleTextSegmentInfoFormat -Dtests.method=testSort -Dtests.seed=61D2298FBC9DEB3E -Dtests.locale=el-CY -Dtests.timezone=America/Indiana/Petersburg -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.01s J0 | TestSimpleTextSegmentInfoFormat.testSort <<< [junit4] > Throwable #1: java.lang.IllegalStateException: Unexpected sort type: CUSTOM [junit4] > at __randomizedtesting.SeedInfo.seed([61D2298FBC9DEB3E:303701A677D0F12E]:0) [junit4] > at org.apache.lucene.codecs.simpletext.SimpleTextSegmentInfoFormat.write(SimpleTextSegmentInfoFormat.java:373) [junit4] > at org.apache.lucene.index.BaseSegmentInfoFormatTestCase.testSort(BaseSegmentInfoFormatTestCase.java:268) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: leaving temporary files on disk at: /l/jim/lucene/build/codecs/test/J0/temp/lucene.codecs.simpletext.TestSimpleTextSegmentInfoFormat_61D22 The new CorruptIndexException s thrown in Lucene62SegmentInfoFormat.java have the wrong message I think? Shouldn't it be something like "invalid SortedSetSelector type: " + type ? Can you bump the version value in Lucene62SegmentInfoFormat.java , and set VERSION_CURRENT to the new version? We need to do this when we make any index format change so that if e.g. and old Lucene version tries to read a newer index (written with this change) they see an IndexFormatTooNewException and not CorruptIndexException .
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Thanks Michael McCandless. Sorry I didn't ran the full tests for the last patch.
        I've attached a new one which passes all the tests. I fixed the exceptions and the SimpleText codec. Could you take another look ?

        Show
        jim.ferenczi Jim Ferenczi added a comment - Thanks Michael McCandless . Sorry I didn't ran the full tests for the last patch. I've attached a new one which passes all the tests. I fixed the exceptions and the SimpleText codec. Could you take another look ?
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Jim Ferenczi, this looks very close!

        For SimpleText, could we use "multi_valued_string" instead of "sorted_string" to describe it in the segment info format?

        Can we also name the new tests with "multi valued" instead of "sorted", e.g. testBasicSortedLong --> testMultiValuedLong?

        Can we change this:

        throw new IllegalStateException("Unexpected SortedNumericSortField type: " + sortField.getType());
        

        to this?:

        throw new IllegalStateException("Unexpected SortedNumericSortField " + sortField);
        

        Just so we get more information if we ever do hit these exceptions...

        Instead of:

          static final int VERSION_CURRENT = 1;
        

        Can you introduce a separate constant for the value 1? E.g.:

          static final int VERSION_MULTI_VALUED_SORT = 1;
          static final int VERSION_CURRENT = VERSION_MULTI_VALUED_SORT;
        

        It just gives us a bit more info about why the format change happened.

        Hmm, I see we have no back-compat testing of sorted indices ... I'll
        open an issue to fix that. Once this change is released to the wild
        (6.4.0), we have to improve that to include multi-valued sort ... I'll
        do that too.

        Show
        mikemccand Michael McCandless added a comment - Thanks Jim Ferenczi , this looks very close! For SimpleText , could we use "multi_valued_string" instead of "sorted_string" to describe it in the segment info format? Can we also name the new tests with "multi valued" instead of "sorted", e.g. testBasicSortedLong --> testMultiValuedLong ? Can we change this: throw new IllegalStateException("Unexpected SortedNumericSortField type: " + sortField.getType()); to this?: throw new IllegalStateException("Unexpected SortedNumericSortField " + sortField); Just so we get more information if we ever do hit these exceptions... Instead of: static final int VERSION_CURRENT = 1; Can you introduce a separate constant for the value 1 ? E.g.: static final int VERSION_MULTI_VALUED_SORT = 1; static final int VERSION_CURRENT = VERSION_MULTI_VALUED_SORT; It just gives us a bit more info about why the format change happened. Hmm, I see we have no back-compat testing of sorted indices ... I'll open an issue to fix that. Once this change is released to the wild (6.4.0), we have to improve that to include multi-valued sort ... I'll do that too.
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Thanks Michael McCandless, I attached a new patch that addresses your comments.
        I can also make another path for 6.4 if needed.

        Show
        jim.ferenczi Jim Ferenczi added a comment - Thanks Michael McCandless , I attached a new patch that addresses your comments. I can also make another path for 6.4 if needed.
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Jim Ferenczi; I still see e.g.:

        +          case "sorted_string":
        +            type = SortField.Type.STRING;
        +            selectorSet = readSetSelector(input, scratch);
        +            break;
        

        in SimpleText ... can we maybe rename that to:

              case "multi_valued_string":
                ...
        

        Otherwise I think this is ready!

        Show
        mikemccand Michael McCandless added a comment - Thanks Jim Ferenczi ; I still see e.g.: + case "sorted_string": + type = SortField.Type.STRING; + selectorSet = readSetSelector(input, scratch); + break; in SimpleText ... can we maybe rename that to: case "multi_valued_string": ... Otherwise I think this is ready!
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Oh right "sorted_string" is ambiguous. Here is another patch with the renaming to "multi_valued" for string and numerics.
        Thanks Michael McCandless

        Show
        jim.ferenczi Jim Ferenczi added a comment - Oh right "sorted_string" is ambiguous. Here is another patch with the renaming to "multi_valued" for string and numerics. Thanks Michael McCandless
        Hide
        mikemccand Michael McCandless added a comment -

        Jim Ferenczi thanks, this looks awesome ... I'll run tests and push soon.

        Show
        mikemccand Michael McCandless added a comment - Jim Ferenczi thanks, this looks awesome ... I'll run tests and push soon.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 6c3c6bc3797307efa13cae06778d41f24a26bccb in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6c3c6bc ]

        LUCENE-7537: Index time sorting now supports multi-valued sorts using selectors (MIN, MAX, etc.)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 6c3c6bc3797307efa13cae06778d41f24a26bccb in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6c3c6bc ] LUCENE-7537 : Index time sorting now supports multi-valued sorts using selectors (MIN, MAX, etc.)
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit e357f957f3059add5582b9695f838794c386dcad in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e357f95 ]

        LUCENE-7537: Index time sorting now supports multi-valued sorts using selectors (MIN, MAX, etc.)

        Show
        jira-bot ASF subversion and git services added a comment - Commit e357f957f3059add5582b9695f838794c386dcad in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e357f95 ] LUCENE-7537 : Index time sorting now supports multi-valued sorts using selectors (MIN, MAX, etc.)
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 64b9eefaa931b4fc8b2345e2307eff4a317e3450 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=64b9eef ]

        LUCENE-7537: fix some 6.x backport issues

        Show
        jira-bot ASF subversion and git services added a comment - Commit 64b9eefaa931b4fc8b2345e2307eff4a317e3450 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=64b9eef ] LUCENE-7537 : fix some 6.x backport issues
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Jim Ferenczi!

        Show
        mikemccand Michael McCandless added a comment - Thanks Jim Ferenczi !

          People

          • Assignee:
            Unassigned
            Reporter:
            jim.ferenczi Jim Ferenczi
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development