Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Global ordinal based query time join as an alternative to the current query time join. The implementation is faster for subsequent joins between reopens, but requires an OrdinalMap to be built.

      This join has certain restrictions and requirements:

      • A document can only refer to on other document. (but can be referred by one or more documents)
      • A type field must exist on all documents and each document must be categorized to a type. This is to distingues between the "from" and "to" side.
      • There must be a single sorted doc values field use by both the "from" and "to" documents. By encoding join into a single doc values field it is trival to build an ordinals map from it.
      1. LUCENE-6352.patch
        63 kB
        Martijn van Groningen
      2. LUCENE-6352.patch
        63 kB
        Martijn van Groningen
      3. LUCENE-6352.patch
        63 kB
        Martijn van Groningen
      4. LUCENE-6352.patch
        63 kB
        Martijn van Groningen
      5. LUCENE-6352.patch
        62 kB
        Martijn van Groningen
      6. LUCENE-6352.patch
        52 kB
        Martijn van Groningen
      7. LUCENE-6352.patch
        20 kB
        Martijn van Groningen

        Activity

        Hide
        Martijn van Groningen added a comment -

        Attached draft patch.

        Show
        Martijn van Groningen added a comment - Attached draft patch.
        Hide
        Martijn van Groningen added a comment -

        Attached a new version of the global ordinal based query time join:

        • Added support for the max, total and avg score mode.
        • Added more tests.
        Show
        Martijn van Groningen added a comment - Attached a new version of the global ordinal based query time join: Added support for the max, total and avg score mode. Added more tests.
        Hide
        Martijn van Groningen added a comment -

        Attached a new version:

        • added hascode/equal/extractTerms methods to the query impls.
        • added optimization for in the case an index has only 1 segment
        • updated to the latest two phase iterator changes.
        Show
        Martijn van Groningen added a comment - Attached a new version: added hascode/equal/extractTerms methods to the query impls. added optimization for in the case an index has only 1 segment updated to the latest two phase iterator changes.
        Hide
        Adrien Grand added a comment -

        Thanks Martijn! I had a look at the patch it looks very clean, I like it.

        Query rewrittenFromQuery = fromQuery.rewrite(indexReader); (JoinUtil.java)
        

        I think you should rather call searcher.rewrite(fromQuery) here, which will take care of rewriting until rewrite returns 'this'.

        final float[][] blocks = new float[Integer.MAX_VALUE / arraySize][];
        

        Instead of allocating based on Integer.MAX_VALUE, maybe it should use the number of unique values? ie. '(int) (((long) valueCount + arraySize - 1) / arraySize)' ?

        return new ComplexExplanation(true, score, "Score based on join value " + joinValue.utf8ToString());
        

        I don't think it is safe to convert to a string as we have no idea whether the value represents an utf8 string?

        In BaseGlobalOrdinalScorer, you are caching the current doc ID, maybe we should not? When I worked on approximations, caching the current doc ID proved to be quite error-prone and it was often better to just call approximation.docID() when the current doc ID was needed.

        Another thing I'm wondering about is the equals/hashCode impl of this global ordinal query: since documents that match depend on what happens in other segments, this query cannot be cached per segment. So maybe it should include the current IndexReader in its equals/hashCode comparison in order to work correctly with query caches? In the read-only case, this would still allow this query to be cached since the current reader never changes while in the read/write case this query will unlikely be cached given that the query cache will notice that it does not get reused?

        Show
        Adrien Grand added a comment - Thanks Martijn! I had a look at the patch it looks very clean, I like it. Query rewrittenFromQuery = fromQuery.rewrite(indexReader); (JoinUtil.java) I think you should rather call searcher.rewrite(fromQuery) here, which will take care of rewriting until rewrite returns 'this'. final float [][] blocks = new float [ Integer .MAX_VALUE / arraySize][]; Instead of allocating based on Integer.MAX_VALUE, maybe it should use the number of unique values? ie. '(int) (((long) valueCount + arraySize - 1) / arraySize)' ? return new ComplexExplanation( true , score, "Score based on join value " + joinValue.utf8ToString()); I don't think it is safe to convert to a string as we have no idea whether the value represents an utf8 string? In BaseGlobalOrdinalScorer, you are caching the current doc ID, maybe we should not? When I worked on approximations, caching the current doc ID proved to be quite error-prone and it was often better to just call approximation.docID() when the current doc ID was needed. Another thing I'm wondering about is the equals/hashCode impl of this global ordinal query: since documents that match depend on what happens in other segments, this query cannot be cached per segment. So maybe it should include the current IndexReader in its equals/hashCode comparison in order to work correctly with query caches? In the read-only case, this would still allow this query to be cached since the current reader never changes while in the read/write case this query will unlikely be cached given that the query cache will notice that it does not get reused?
        Hide
        Martijn van Groningen added a comment -

        Adrien, Thanks for taking a look at it!

        I updated the patch and applied all the comments, but the comment about the explain. I wonder if we can check whether a BytesRef is valid utf8 and then convert it? Otherwise just use the BytesRef directly. I like the explain to be somewhat useful and this is the best I can think of right now.

        Show
        Martijn van Groningen added a comment - Adrien, Thanks for taking a look at it! I updated the patch and applied all the comments, but the comment about the explain. I wonder if we can check whether a BytesRef is valid utf8 and then convert it? Otherwise just use the BytesRef directly. I like the explain to be somewhat useful and this is the best I can think of right now.
        Hide
        Adrien Grand added a comment -

        Or maybe we could just document that this feature expects that the join field stores utf8 string values?

        Show
        Adrien Grand added a comment - Or maybe we could just document that this feature expects that the join field stores utf8 string values?
        Hide
        Martijn van Groningen added a comment -

        Yes, that makes sense. I updated the join util jdocs to reflect this.

        Show
        Martijn van Groningen added a comment - Yes, that makes sense. I updated the join util jdocs to reflect this.
        Hide
        Martijn van Groningen added a comment -

        Another patch, forgot to change the size of the int[][] array in GlobalOrdinalsWithScoreCollector.Occurrences

        Show
        Martijn van Groningen added a comment - Another patch, forgot to change the size of the int[][] array in GlobalOrdinalsWithScoreCollector.Occurrences
        Hide
        Adrien Grand added a comment -

        Just had another look at the patch and found two issues:

        • Occurrences still allocates blocks using MAX_VALUE instead of the number of docs per segment
        • Scores allocates using '(valueCount + arraySize - 1) / arraySize' but I think we need to cast to a long before the addition and then back to an int after the division in order to avoid overflows if the doc count in the segment is greater than MAX_VALUE - arraySize. So this would be: '(int) (((long) valueCount + arraySize - 1) / arraySize)'

        Otherwise +1 to commit! This is interesting usage of two-phase iteration.

        Show
        Adrien Grand added a comment - Just had another look at the patch and found two issues: Occurrences still allocates blocks using MAX_VALUE instead of the number of docs per segment Scores allocates using '(valueCount + arraySize - 1) / arraySize' but I think we need to cast to a long before the addition and then back to an int after the division in order to avoid overflows if the doc count in the segment is greater than MAX_VALUE - arraySize. So this would be: '(int) (((long) valueCount + arraySize - 1) / arraySize)' Otherwise +1 to commit! This is interesting usage of two-phase iteration.
        Hide
        Martijn van Groningen added a comment -

        Good point about the potential overflow issue! I changed the patch to avoid this. I'll commit this shortly to trunk and 5x.

        Show
        Martijn van Groningen added a comment - Good point about the potential overflow issue! I changed the patch to avoid this. I'll commit this shortly to trunk and 5x.
        Hide
        ASF subversion and git services added a comment -

        Commit 1670990 from Martijn van Groningen in branch 'dev/trunk'
        [ https://svn.apache.org/r1670990 ]

        LUCENE-6352: Added a new query time join to the join module that uses global ordinals, which is faster for subsequent joins between reopens.

        Show
        ASF subversion and git services added a comment - Commit 1670990 from Martijn van Groningen in branch 'dev/trunk' [ https://svn.apache.org/r1670990 ] LUCENE-6352 : Added a new query time join to the join module that uses global ordinals, which is faster for subsequent joins between reopens.
        Hide
        ASF subversion and git services added a comment -

        Commit 1670991 from Martijn van Groningen in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1670991 ]

        LUCENE-6352: Added a new query time join to the join module that uses global ordinals, which is faster for subsequent joins between reopens.

        Show
        ASF subversion and git services added a comment - Commit 1670991 from Martijn van Groningen in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1670991 ] LUCENE-6352 : Added a new query time join to the join module that uses global ordinals, which is faster for subsequent joins between reopens.
        Hide
        ASF subversion and git services added a comment -

        Commit 1671774 from Martijn van Groningen in branch 'dev/trunk'
        [ https://svn.apache.org/r1671774 ]

        LUCENE-6352: Improved tests for global ordinal join

        Show
        ASF subversion and git services added a comment - Commit 1671774 from Martijn van Groningen in branch 'dev/trunk' [ https://svn.apache.org/r1671774 ] LUCENE-6352 : Improved tests for global ordinal join
        Hide
        ASF subversion and git services added a comment -

        Commit 1671777 from Martijn van Groningen in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1671777 ]

        LUCENE-6352: Improved tests for global ordinal join

        Show
        ASF subversion and git services added a comment - Commit 1671777 from Martijn van Groningen in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1671777 ] LUCENE-6352 : Improved tests for global ordinal join

          People

          • Assignee:
            Unassigned
            Reporter:
            Martijn van Groningen
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development