Lucene - Core
  1. Lucene - Core
  2. LUCENE-5868

JoinUtil support for NUMERIC docValues fields

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      while polishing SOLR-6234 I found that JoinUtil can't join int dv fields at least.
      I plan to provide test/patch. It might be important, because Solr's join can do that. Please vote if you care!

      1. LUCENE-5868.patch
        41 kB
        Mikhail Khludnev
      2. LUCENE-5868.patch
        40 kB
        Mikhail Khludnev
      3. LUCENE-5868.patch
        38 kB
        Mikhail Khludnev
      4. LUCENE-5868.patch
        49 kB
        Mikhail Khludnev
      5. LUCENE-5868.patch
        39 kB
        Mikhail Khludnev
      6. LUCENE-5868.patch
        24 kB
        Alexey Zelin
      7. LUCENE-5868-5x.patch
        44 kB
        Mikhail Khludnev
      8. LUCENE-5868-5x.patch
        45 kB
        Mikhail Khludnev
      9. LUCENE-5868-lambdarefactoring.patch
        20 kB
        Mikhail Khludnev
      10. LUCENE-5868-lambdarefactoring.patch
        15 kB
        Mikhail Khludnev
      11. qtj.diff
        12 kB
        Alexey Zelin

        Issue Links

          Activity

          Hide
          Mikhail Khludnev added a comment -

          It seems nobody cares. Closing so far.

          Show
          Mikhail Khludnev added a comment - It seems nobody cares. Closing so far.
          Hide
          marc schipperheyn added a comment -

          I'd vote for it.

          Show
          marc schipperheyn added a comment - I'd vote for it.
          Hide
          Mikhail Khludnev added a comment -

          marc schipperheyn what's your usecase? why you can't use SORTED? do you need to join cross cores? Have you had a look at OrdinalMap join?

          Show
          Mikhail Khludnev added a comment - marc schipperheyn what's your usecase? why you can't use SORTED? do you need to join cross cores? Have you had a look at OrdinalMap join?
          Hide
          Alexey Zelin added a comment - - edited

          Numeric type processing added during join query build for single value case and int values. See attached qty.diff

          Show
          Alexey Zelin added a comment - - edited Numeric type processing added during join query build for single value case and int values. See attached qty.diff
          Hide
          Mikhail Khludnev added a comment -

          Alexey Zelin the patch make sense please pay attention to the following points:

          • http://wiki.apache.org/lucene-java/HowToContribute

            and save them into the LUCENE-NNNN.patch file.

            Read the patch file. Make sure it includes ONLY the modifications required to fix a single issue.

          • I suppose we need to cover all existing cases, ie. the scope of the issue should include:
            TermsCollector.MV, TermsCollector.SV, TermsWithScoreCollector.MV, TermsWithScoreCollector.MV.Avg, TermsWithScoreCollector.SV, TermsWithScoreCollector.SV.Avg,... yepp too many, I see.
          • as an idea to avoid copy-paste by bridging different DV types. NumericDocValues can be adapted to BinaryDocValues
          • such adapter can reuse BytesRefBuilder (giving that BytesRefHash copies bytes)
          • the same approach can be done with adapting SortedNumericDocValues to SortedSetDocValues
          • I suppose it's ok to keep it lenient: silently allow to shoot legs by having different DV types across segments.
          • As I understand, TestJoinUtilInt is just a first scratch. I suppose it's worth to accurately expand existing tests:
            • I suppose TestJoinUtil.testSimple() testSimpleWithScoring() you can add from_num to_num fields into sample docs, and randomly switch these fields for passing into createJoinQuery()
            • in TestJoinUtilInt you are trying to create numeric DV by setDocValuesType(DocValuesType.NUMERIC);, I don't belive it work, and it's handled by UnInvertingReader in run-time. So, I suggest to add NumericDocValuesField and SortedNumericDocValuesField (as mv case) explicitly. But let's randomly switch to existing approach (just add indexed field and rely on UnInvertingReader) just for smoke testing.

          I'll handle as separate issues:

          • extending TestScoreJoinQPScore.testSimpleWithScoring() for coverage
          • extending TestJoinUtil.test*ValueRandomJoin() for coverage

          Beside of the patch, for further consideration: if we could provide field types by something like Solr Schema/FieldTypes into JoinUtil. such issue would be autodone.

          Show
          Mikhail Khludnev added a comment - Alexey Zelin the patch make sense please pay attention to the following points: http://wiki.apache.org/lucene-java/HowToContribute and save them into the LUCENE-NNNN.patch file. Read the patch file. Make sure it includes ONLY the modifications required to fix a single issue. I suppose we need to cover all existing cases, ie. the scope of the issue should include: TermsCollector.MV, TermsCollector.SV, TermsWithScoreCollector.MV, TermsWithScoreCollector.MV.Avg, TermsWithScoreCollector.SV, TermsWithScoreCollector.SV.Avg ,... yepp too many, I see. as an idea to avoid copy-paste by bridging different DV types. NumericDocValues can be adapted to BinaryDocValues such adapter can reuse BytesRefBuilder (giving that BytesRefHash copies bytes) the same approach can be done with adapting SortedNumericDocValues to SortedSetDocValues I suppose it's ok to keep it lenient: silently allow to shoot legs by having different DV types across segments. As I understand, TestJoinUtilInt is just a first scratch. I suppose it's worth to accurately expand existing tests: I suppose TestJoinUtil.testSimple() testSimpleWithScoring() you can add from_num to_num fields into sample docs, and randomly switch these fields for passing into createJoinQuery() in TestJoinUtilInt you are trying to create numeric DV by setDocValuesType(DocValuesType.NUMERIC); , I don't belive it work, and it's handled by UnInvertingReader in run-time. So, I suggest to add NumericDocValuesField and SortedNumericDocValuesField (as mv case) explicitly. But let's randomly switch to existing approach (just add indexed field and rely on UnInvertingReader) just for smoke testing. I'll handle as separate issues: extending TestScoreJoinQPScore.testSimpleWithScoring() for coverage extending TestJoinUtil.test*ValueRandomJoin() for coverage Beside of the patch, for further consideration: if we could provide field types by something like Solr Schema/FieldTypes into JoinUtil. such issue would be autodone.
          Hide
          Alexey Zelin added a comment -

          Added long values support as well as multi value fields support.

          Show
          Alexey Zelin added a comment - Added long values support as well as multi value fields support.
          Hide
          Mikhail Khludnev added a comment -

          I decided to attach a wild lambda refactoring first. There is no change functionally. Martijn van Groningen and all lambda-fans, you are kindly invited to have a look at LUCENE-5868-lambdarefactoring.patch.

          Show
          Mikhail Khludnev added a comment - I decided to attach a wild lambda refactoring first. There is no change functionally. Martijn van Groningen and all lambda -fans, you are kindly invited to have a look at LUCENE-5868-lambdarefactoring.patch .
          Hide
          Mikhail Khludnev added a comment -

          attaching LUCENE-5868.patch. Glue all stuff together. The significant change is introducing a new signature in JoinUtil, you know why:

            public static Query createJoinQuery(String fromField,
                boolean multipleValuesPerDocument,
                String toField, 
          NumericType numericType,
                Query fromQuery,
                IndexSearcher fromSearcher,
                ScoreMode scoreMode) throws IOException 
          

          I added existing test in the patch. Test coverage needs to be improved.

          Opinions?

          Show
          Mikhail Khludnev added a comment - attaching LUCENE-5868.patch . Glue all stuff together. The significant change is introducing a new signature in JoinUtil , you know why: public static Query createJoinQuery( String fromField, boolean multipleValuesPerDocument, String toField, NumericType numericType, Query fromQuery, IndexSearcher fromSearcher, ScoreMode scoreMode) throws IOException I added existing test in the patch. Test coverage needs to be improved. Opinions?
          Hide
          David Smiley added a comment -

          Nice work Mikhail! I love the lambdas.

          Some random comments:

          • coder() could take the field name so that the IllegalArgumentException can report the field in error
          • please put spaces after if and around else. This is the dominant style in our codebase, and I prefer it too, FWIW.
          • createCollectorSV() could have one switch instead of an if and then a switch; no?
          Show
          David Smiley added a comment - Nice work Mikhail! I love the lambdas. Some random comments: coder() could take the field name so that the IllegalArgumentException can report the field in error please put spaces after if and around else . This is the dominant style in our codebase, and I prefer it too, FWIW. createCollectorSV() could have one switch instead of an if and then a switch; no?
          Hide
          Mikhail Khludnev added a comment - - edited

          Thanks, David Smiley! Do you really think this surgery makes sense?
          I addressed your points above. The curios fact is that joining by numbers takes more heap than strings.
          I tried to provide testRandom..() coverage. so far it fails on comparing score values.
          see recent LUCENE-5868.patch.

          Show
          Mikhail Khludnev added a comment - - edited Thanks, David Smiley ! Do you really think this surgery makes sense? I addressed your points above. The curios fact is that joining by numbers takes more heap than strings. I tried to provide testRandom..() coverage. so far it fails on comparing score values. see recent LUCENE-5868.patch .
          Hide
          David Smiley added a comment -

          (I edited your comment to correct your reference to me, not a similarly named person)

          I think the surgery makes sense. It adds a useful feature. The approach leverages the existing underlying BytesRefHash which may not be as optimal as some sort of LongHash or similar but whatever – progress not perfection. Should someone care, such improvements could be made later.

          I admit I didn't look at the details of your tests; that would take much more time. I was mostly curious about the implementation side and of the lambdas you made reference to.

          Show
          David Smiley added a comment - (I edited your comment to correct your reference to me, not a similarly named person) I think the surgery makes sense. It adds a useful feature. The approach leverages the existing underlying BytesRefHash which may not be as optimal as some sort of LongHash or similar but whatever – progress not perfection. Should someone care, such improvements could be made later. I admit I didn't look at the details of your tests; that would take much more time. I was mostly curious about the implementation side and of the lambdas you made reference to.
          Hide
          Martijn van Groningen added a comment -

          +1 this looks good. One small thing, maybe rename the parameter name in the protected `createJoinQuery(...)` method from `termsWithScoreCollector` to just `collector`?

          Show
          Martijn van Groningen added a comment - +1 this looks good. One small thing, maybe rename the parameter name in the protected `createJoinQuery(...)` method from `termsWithScoreCollector` to just `collector`?
          Hide
          Mikhail Khludnev added a comment -

          Martijn van Groningen, I renamed parameter, yet locally.

          It turns out that random test fails because SortedSetDV omits duplicate values, but SortedNumberDV doesn't that leads to discrepancy in the scores. I changed TestJoinUtil.createContext(int, boolean, boolean) to deduplicate link values. Score asserts still fail.

          Show
          Mikhail Khludnev added a comment - Martijn van Groningen , I renamed parameter, yet locally. It turns out that random test fails because SortedSetDV omits duplicate values, but SortedNumberDV doesn't that leads to discrepancy in the scores. I changed TestJoinUtil.createContext(int, boolean, boolean) to deduplicate link values. Score asserts still fail.
          Hide
          Mikhail Khludnev added a comment -

          LUCENE-5868.patch It's done, after all. I had to tweak TestJoinUtil for random tests. Now it generates values ordered similarly for strings, and numbers. It also dedupes values for from, to fields, because numeric docvalues store duplicates and it impact scoring in tests. Now, there is no "simple" test coverage for numeric join. I don't think it's necessary, perhaps I'll cover it in simple Solr case.
          I want to commit it next week into trunk and 5.x to let it out in 5.5. Please let me know if you wish to veto it. Reviews and ideas are welcome as usual!!

          Show
          Mikhail Khludnev added a comment - LUCENE-5868.patch It's done, after all. I had to tweak TestJoinUtil for random tests. Now it generates values ordered similarly for strings, and numbers. It also dedupes values for from , to fields, because numeric docvalues store duplicates and it impact scoring in tests. Now, there is no "simple" test coverage for numeric join. I don't think it's necessary, perhaps I'll cover it in simple Solr case. I want to commit it next week into trunk and 5.x to let it out in 5.5 . Please let me know if you wish to veto it. Reviews and ideas are welcome as usual!!
          Hide
          Mikhail Khludnev added a comment -

          even more efficient shuffling in test LUCENE-5868.patch

          Show
          Mikhail Khludnev added a comment - even more efficient shuffling in test LUCENE-5868.patch
          Hide
          Mikhail Khludnev added a comment -

          I'm going to commit LUCENE-5868.patch in trunk and 5.x then in couple of hours.

          • applied all suggested changes
          • amended CHANGES.txt
          • checked precommit and javadocs
          Show
          Mikhail Khludnev added a comment - I'm going to commit LUCENE-5868.patch in trunk and 5.x then in couple of hours. applied all suggested changes amended CHANGES.txt checked precommit and javadocs
          Hide
          ASF subversion and git services added a comment -

          Commit 1718443 from mkhl@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1718443 ]

          LUCENE-5868: query-time join for numerics

          Show
          ASF subversion and git services added a comment - Commit 1718443 from mkhl@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1718443 ] LUCENE-5868 : query-time join for numerics
          Hide
          Mikhail Khludnev added a comment -

          oh.my... there are no λ-s in 5.x. the patch is gonna missing its' most of beauty

          Show
          Mikhail Khludnev added a comment - oh.my... there are no λ-s in 5.x. the patch is gonna missing its' most of beauty
          Hide
          Mikhail Khludnev added a comment -

          moved to java7, got LUCENE-5868-5x.patch

          Show
          Mikhail Khludnev added a comment - moved to java7, got LUCENE-5868-5x.patch
          Hide
          Mikhail Khludnev added a comment -

          fixed some precommit issues in LUCENE-5868-5x.patch

          Show
          Mikhail Khludnev added a comment - fixed some precommit issues in LUCENE-5868-5x.patch
          Hide
          ASF subversion and git services added a comment -

          Commit 1718473 from mkhl@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1718473 ]

          LUCENE-5868: query-time join for numerics

          Show
          ASF subversion and git services added a comment - Commit 1718473 from mkhl@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1718473 ] LUCENE-5868 : query-time join for numerics
          Show
          Mikhail Khludnev added a comment - it passed https://builds.apache.org/job/Lucene-Solr-NightlyTests-trunk/875/console
          Hide
          Steve Rowe added a comment -

          Compilation is failing on branch_5x: https://builds.apache.org/job/Lucene-Artifacts-5.x/1037/ (java8 constructs on a java7 build I imagine):

             [javac] Compiling 6 source files to /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-5.x/lucene/build/join/classes/test
             [javac] /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-5.x/lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java:954: error: cannot find symbol
             [javac]         assert nextInt == Integer.parseUnsignedInt(uniqueRandomValue,16);
             [javac]                                  ^
             [javac]   symbol:   method parseUnsignedInt(String,int)
             [javac]   location: class Integer
             [javac] /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-5.x/lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java:1206: error: cannot find symbol
             [javac]     final int linkInt = Integer.parseUnsignedInt(linkValue,16);
             [javac]                                ^
             [javac]   symbol:   method parseUnsignedInt(String,int)
             [javac]   location: class Integer
             [javac] Note: /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-5.x/lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoin.java uses or overrides a deprecated API.
             [javac] Note: Recompile with -Xlint:deprecation for details.
             [javac] 2 errors
          
          Show
          Steve Rowe added a comment - Compilation is failing on branch_5x: https://builds.apache.org/job/Lucene-Artifacts-5.x/1037/ (java8 constructs on a java7 build I imagine): [javac] Compiling 6 source files to /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-5.x/lucene/build/join/classes/test [javac] /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-5.x/lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java:954: error: cannot find symbol [javac] assert nextInt == Integer.parseUnsignedInt(uniqueRandomValue,16); [javac] ^ [javac] symbol: method parseUnsignedInt(String,int) [javac] location: class Integer [javac] /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-5.x/lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java:1206: error: cannot find symbol [javac] final int linkInt = Integer.parseUnsignedInt(linkValue,16); [javac] ^ [javac] symbol: method parseUnsignedInt(String,int) [javac] location: class Integer [javac] Note: /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-5.x/lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoin.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] 2 errors
          Hide
          Mikhail Khludnev added a comment -

          I'm going to fix it

          Show
          Mikhail Khludnev added a comment - I'm going to fix it
          Hide
          ASF subversion and git services added a comment -

          Commit 1718517 from mkhl@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1718517 ]

          LUCENE-5868: removing Java8's parseUnsignedInt

          Show
          ASF subversion and git services added a comment - Commit 1718517 from mkhl@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1718517 ] LUCENE-5868 : removing Java8's parseUnsignedInt
          Hide
          Mikhail Khludnev added a comment - - edited
          Show
          Mikhail Khludnev added a comment - - edited Oh. Yeah!! 5.x is fixed https://builds.apache.org/job/Lucene-Solr-Tests-5.x-Java7/3822/changes
          Hide
          Mikhail Khludnev added a comment -

          Suggestions for javadoc are accepted

          Show
          Mikhail Khludnev added a comment - Suggestions for javadoc are accepted
          Hide
          Mikhail Khludnev added a comment -
          Index: lucene/common-build.xml
          ===================================================================
          --- lucene/common-build.xml	(revision 1718516)
          +++ lucene/common-build.xml	(working copy)
          @@ -321,6 +321,12 @@
               </condition>
             </fail>
           
          +  <fail message="Maximum supported Java version is 1.7.">
          +    <condition>
          +      <hasmethod classname="java.lang.Integer" method="parseUnsignedInt"/>
          +    </condition>
          +  </fail>
          +	
             <!-- temporary for cleanup of java.specification.version, to be in format "x.y" -->
             <loadresource property="-cleaned.specification.version">
               <propertyresource name="java.specification.version"/>
          

          This amendment will keep me from such mistakes in 5.x. Let me know if there is a more regular approach.

          Show
          Mikhail Khludnev added a comment - Index: lucene/common-build.xml =================================================================== --- lucene/common-build.xml (revision 1718516) +++ lucene/common-build.xml (working copy) @@ -321,6 +321,12 @@ </condition> </fail> + <fail message= "Maximum supported Java version is 1.7." > + <condition> + <hasmethod classname= "java.lang. Integer " method= "parseUnsignedInt" /> + </condition> + </fail> + <!-- temporary for cleanup of java.specification.version, to be in format "x.y" --> <loadresource property= "-cleaned.specification.version" > <propertyresource name= "java.specification.version" /> This amendment will keep me from such mistakes in 5.x. Let me know if there is a more regular approach.
          Hide
          Mikhail Khludnev added a comment -

          raised SOLR-8395 as Solr enablement

          Show
          Mikhail Khludnev added a comment - raised SOLR-8395 as Solr enablement

            People

            • Assignee:
              Mikhail Khludnev
              Reporter:
              Mikhail Khludnev
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development