Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7736

Expose some IndexReader stats via DoubleValuesSources

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.2
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We have a number of ValueSource implementations that expose IndexReader stats (numDocs, termFreq, etc). We should re-implement these as DoubleValuesSources, allowing them to be used in FunctionScoreQuery, etc.

      1. LUCENE-7736.patch
        20 kB
        Alan Woodward
      2. LUCENE-7736.patch
        33 kB
        Alan Woodward
      3. LUCENE-7736.patch
        38 kB
        Alan Woodward
      4. LUCENE-7736.patch
        50 kB
        Alan Woodward
      5. LUCENE-7736.patch
        55 kB
        Alan Woodward
      6. LUCENE-7736.patch
        59 kB
        Alan Woodward

        Activity

        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 6b621ea659b5a81f98a432b4c5b4f061947b4c41 in lucene-solr's branch refs/heads/branch_7x from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6b621ea ]

        LUCENE-7736: CoveringQuery needs to rewrite its ValuesSource.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 6b621ea659b5a81f98a432b4c5b4f061947b4c41 in lucene-solr's branch refs/heads/branch_7x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6b621ea ] LUCENE-7736 : CoveringQuery needs to rewrite its ValuesSource.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 3f0b240853dac1e9fad3b29042d9aaf6e9f0e634 in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3f0b240 ]

        LUCENE-7736: CoveringQuery needs to rewrite its ValuesSource.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 3f0b240853dac1e9fad3b29042d9aaf6e9f0e634 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3f0b240 ] LUCENE-7736 : CoveringQuery needs to rewrite its ValuesSource.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit e53ce592baf3db5bd6290955a8175f51d432b532 in lucene-solr's branch refs/heads/master from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e53ce59 ]

        LUCENE-7736: CHANGES.txt

        Show
        jira-bot ASF subversion and git services added a comment - Commit e53ce592baf3db5bd6290955a8175f51d432b532 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e53ce59 ] LUCENE-7736 : CHANGES.txt
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2a4dd499bbd2f5ed1f481e17eeaee4f32b927185 in lucene-solr's branch refs/heads/master from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2a4dd49 ]

        LUCENE-7736: IndexReaderValues

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2a4dd499bbd2f5ed1f481e17eeaee4f32b927185 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2a4dd49 ] LUCENE-7736 : IndexReaderValues
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit fefc9899588de9979c5c0973fec575e6c44de126 in lucene-solr's branch refs/heads/branch_7x from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fefc989 ]

        LUCENE-7736: CHANGES.txt

        Show
        jira-bot ASF subversion and git services added a comment - Commit fefc9899588de9979c5c0973fec575e6c44de126 in lucene-solr's branch refs/heads/branch_7x from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fefc989 ] LUCENE-7736 : CHANGES.txt
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 03afeb7766a39996de3c85e8a6ab24d2a352dd1c in lucene-solr's branch refs/heads/branch_7x from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=03afeb7 ]

        LUCENE-7736: IndexReaderValues

        Show
        jira-bot ASF subversion and git services added a comment - Commit 03afeb7766a39996de3c85e8a6ab24d2a352dd1c in lucene-solr's branch refs/heads/branch_7x from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=03afeb7 ] LUCENE-7736 : IndexReaderValues
        Hide
        romseygeek Alan Woodward added a comment -

        Updated to master; all reader-specific functions here now return 'false' from isCacheable()

        Show
        romseygeek Alan Woodward added a comment - Updated to master; all reader-specific functions here now return 'false' from isCacheable()
        Hide
        romseygeek Alan Woodward added a comment -

        In fact, should all of these just always return false from equals() in order to prevent them being cached and reused incorrectly?

        Show
        romseygeek Alan Woodward added a comment - In fact, should all of these just always return false from equals() in order to prevent them being cached and reused incorrectly?
        Hide
        romseygeek Alan Woodward added a comment -

        I think it's important for equals to behave correctly and never return true for values sources that could return different values.

        Ah, I see what you mean - if we had a FunctionMatchQuery that was cached across multiple IndexSearcher reopens it would potentially return the wrong values, because the underlying DoubleValues defined by the top-level index readers. I'll change things back again.

        I think you forgot to update FunctionMatchQuery

        I did, thanks for picking it up...

        Show
        romseygeek Alan Woodward added a comment - I think it's important for equals to behave correctly and never return true for values sources that could return different values. Ah, I see what you mean - if we had a FunctionMatchQuery that was cached across multiple IndexSearcher reopens it would potentially return the wrong values, because the underlying DoubleValues defined by the top-level index readers. I'll change things back again. I think you forgot to update FunctionMatchQuery I did, thanks for picking it up...
        Hide
        jpountz Adrien Grand added a comment -

        Your point about using the description for equals is correct, but I still find it fragile. I think it's important for equals to behave correctly and never return true for values sources that could return different values.

        Looking again at the patch, I think you forgot to update FunctionMatchQuery to make it rewrite the values source?

        I really dislike the 'rewrite until things stay the same' API on Query...

        Let's open an issue about fixing it?

        Show
        jpountz Adrien Grand added a comment - Your point about using the description for equals is correct, but I still find it fragile. I think it's important for equals to behave correctly and never return true for values sources that could return different values. Looking again at the patch, I think you forgot to update FunctionMatchQuery to make it rewrite the values source? I really dislike the 'rewrite until things stay the same' API on Query... Let's open an issue about fixing it?
        Hide
        romseygeek Alan Woodward added a comment -

        Patch including Adrien's suggestions. This also changes rewrite() to take an IndexSearcher rather than an IndexReader, which is more useful for a follow-up issue I'll open in a moment.

        Show
        romseygeek Alan Woodward added a comment - Patch including Adrien's suggestions. This also changes rewrite() to take an IndexSearcher rather than an IndexReader, which is more useful for a follow-up issue I'll open in a moment.
        Hide
        romseygeek Alan Woodward added a comment -

        two instances that have been built on a different reader will be considered equals

        This is true, but there aren't really any circumstances under which it would happen. Comparison for equality is done when building queries or sorts, and these are indexreader independent. It's only after a rewrite that the values would actually be different, and the only callers of rewrite are internal lucene classes. It would be nice to enforce this somehow by making rewrite() protected, but unfortunately I think it's called from different packages/modules so this won't work.

        Should we use == rather than equals

        Will do.

        Did you remove the if (boost == 1f) return expl; from FunctionScoreQuery on purpose?

        Yes; just returning the straight DVS explanation breaks things when the overall query matches, but the value source has no value - we end up returning Explanation.noMatch() from a document that matches, but has a zero score.

        re TermFreqDoubleValuesSource that's a nice idea, I'll add a DoubleValues.EMPTY instance and use that.

        rewriting only once is enough

        Yes! Perhaps I should add some javadoc making that explicit. I really dislike the 'rewrite until things stay the same' API on Query...

        Show
        romseygeek Alan Woodward added a comment - two instances that have been built on a different reader will be considered equals This is true, but there aren't really any circumstances under which it would happen. Comparison for equality is done when building queries or sorts, and these are indexreader independent. It's only after a rewrite that the values would actually be different, and the only callers of rewrite are internal lucene classes. It would be nice to enforce this somehow by making rewrite() protected, but unfortunately I think it's called from different packages/modules so this won't work. Should we use == rather than equals Will do. Did you remove the if (boost == 1f) return expl; from FunctionScoreQuery on purpose? Yes; just returning the straight DVS explanation breaks things when the overall query matches, but the value source has no value - we end up returning Explanation.noMatch() from a document that matches, but has a zero score. re TermFreqDoubleValuesSource that's a nice idea, I'll add a DoubleValues.EMPTY instance and use that. rewriting only once is enough Yes! Perhaps I should add some javadoc making that explicit. I really dislike the 'rewrite until things stay the same' API on Query...
        Hide
        jpountz Adrien Grand added a comment -

        Thanks Alan. I think those new functions are going to be useful.

        One wrinkle here is that we can't compare closures, so IndexReaderDoubleValuesSource only uses its description field for comparisons. These are all private implementations, and there's a comment to that effect, so I think this is OK.

        I think this is incorrect as two instances that have been built on a different reader will be considered equals even though they produce different values? Let's pass things that are actually comparable to IndexReaderDoubleValuesSource or just live with the fact that two instances that do the same thing are considered unequal (which is ok imo as long as a.equals(b) returns true if a == b)?

        Should we use == rather than equals to know whether the source was rewritten? I don't like one more than the other but since queries already use == I'd rather like things to be consistent?

        Did you remove the if (boost == 1f) return expl; from FunctionScoreQuery on purpose? I know it is not necessary for correctness but I still like the fact that it makes the explanation easier to read in the default case that the boost is 1.

        In TermFreqDoubleValuesSource could we just return an empty instance if the term is not found instead of having to do a null check for every document?

        In lucene/queries/src/java/org/apache/lucene/queries/function/ValueSource.java you seem to be assuming that rewriting only once is enough? Is it true?

        Show
        jpountz Adrien Grand added a comment - Thanks Alan. I think those new functions are going to be useful. One wrinkle here is that we can't compare closures, so IndexReaderDoubleValuesSource only uses its description field for comparisons. These are all private implementations, and there's a comment to that effect, so I think this is OK. I think this is incorrect as two instances that have been built on a different reader will be considered equals even though they produce different values? Let's pass things that are actually comparable to IndexReaderDoubleValuesSource or just live with the fact that two instances that do the same thing are considered unequal (which is ok imo as long as a.equals(b) returns true if a == b)? Should we use == rather than equals to know whether the source was rewritten? I don't like one more than the other but since queries already use == I'd rather like things to be consistent? Did you remove the if (boost == 1f) return expl; from FunctionScoreQuery on purpose? I know it is not necessary for correctness but I still like the fact that it makes the explanation easier to read in the default case that the boost is 1. In TermFreqDoubleValuesSource could we just return an empty instance if the term is not found instead of having to do a null check for every document? In lucene/queries/src/java/org/apache/lucene/queries/function/ValueSource.java you seem to be assuming that rewriting only once is enough? Is it true?
        Hide
        romseygeek Alan Woodward added a comment -

        Final patch, all tests + precommit pass

        Show
        romseygeek Alan Woodward added a comment - Final patch, all tests + precommit pass
        Hide
        romseygeek Alan Woodward added a comment -

        Patch bringing this up-to-date. Relevant changes are:

        • IndexReaderFunctions is final, and I moved the constructor to the top for Uwe's sake
        • Everything implements hashCode and equals now, which means concrete implementations rather than anonymous classes.
        • One wrinkle here is that we can't compare closures, so IndexReaderDoubleValuesSource only uses its description field for comparisons. These are all private implementations, and there's a comment to that effect, so I think this is OK.

        I haven't added anything about checking for double-rewrite, as it doesn't really matter for any of the current implementations - everything either rewrites to a constant, or returns itself.

        Show
        romseygeek Alan Woodward added a comment - Patch bringing this up-to-date. Relevant changes are: IndexReaderFunctions is final, and I moved the constructor to the top for Uwe's sake Everything implements hashCode and equals now, which means concrete implementations rather than anonymous classes. One wrinkle here is that we can't compare closures, so IndexReaderDoubleValuesSource only uses its description field for comparisons. These are all private implementations, and there's a comment to that effect, so I think this is OK. I haven't added anything about checking for double-rewrite, as it doesn't really matter for any of the current implementations - everything either rewrites to a constant, or returns itself.
        Hide
        thetaphi Uwe Schindler added a comment -

        Also add @FunctionalInterface to ReaderFunction interface for consistency!

        Show
        thetaphi Uwe Schindler added a comment - Also add @FunctionalInterface to ReaderFunction interface for consistency!
        Hide
        thetaphi Uwe Schindler added a comment - - edited

        Sorry, those methods are global (on IndexReader, so also composites). So LeafReader is the wrong place. Nuke my comment.

        Nevertheless the IndexReaderFunctions class should be final and have private ctor. Did not find the ctor, as it was at the very end.

        Show
        thetaphi Uwe Schindler added a comment - - edited Sorry, those methods are global (on IndexReader, so also composites). So LeafReader is the wrong place. Nuke my comment. Nevertheless the IndexReaderFunctions class should be final and have private ctor. Did not find the ctor, as it was at the very end.
        Hide
        thetaphi Uwe Schindler added a comment -

        I was thinking about this: Why do we need a static-method-only helper class (that should have a private constructor, please!)? Would it not be possible to add these methods as default impls to LeafReader? I have not looked into the details, but this looks more easy to understand. I agree the current code can be separate from core, but once we move to core, I'd add that to the LeafReaders.

        Show
        thetaphi Uwe Schindler added a comment - I was thinking about this: Why do we need a static-method-only helper class (that should have a private constructor, please!)? Would it not be possible to add these methods as default impls to LeafReader? I have not looked into the details, but this looks more easy to understand. I agree the current code can be separate from core, but once we move to core, I'd add that to the LeafReaders.
        Hide
        jpountz Adrien Grand added a comment -

        So have an AssertingDoubleValuesSource? I like that idea, although I'm not sure where it would get injected into the tests.

        It is indeed harder than with eg. AssertingWeight. I guess one option would be to unwrap instances of FunctionScoreQuery and FunctionMatchQuery in AssertingIndexSearcher.rewrite to replace the values source with an asserting instance. However, I can't think of an easy way to do something similar when values sources are used for eg. sorting. Another option would be to put those assertions directly into existing implementations.

        Show
        jpountz Adrien Grand added a comment - So have an AssertingDoubleValuesSource? I like that idea, although I'm not sure where it would get injected into the tests. It is indeed harder than with eg. AssertingWeight. I guess one option would be to unwrap instances of FunctionScoreQuery and FunctionMatchQuery in AssertingIndexSearcher.rewrite to replace the values source with an asserting instance. However, I can't think of an easy way to do something similar when values sources are used for eg. sorting. Another option would be to put those assertions directly into existing implementations.
        Hide
        jpountz Adrien Grand added a comment -

        I think it makes sense for these values sources to return doubles natively as their main use-case is to be used for scoring, which needs to produce floats?

        Show
        jpountz Adrien Grand added a comment - I think it makes sense for these values sources to return doubles natively as their main use-case is to be used for scoring, which needs to produce floats?
        Hide
        romseygeek Alan Woodward added a comment -

        I'm wondering whether we should try to detect double rewrite

        So have an AssertingDoubleValuesSource? I like that idea, although I'm not sure where it would get injected into the tests.

        wouldn't LongValueSource be more fitting/appropriate?

        DoubleValuesSource is used in more places though (FunctionScoreQuery, Expressions module), so I think it's more useable this way. I guess I see LongValuesSource as a convenience wrapper, and DoubleValuesSource as the core API.

        Show
        romseygeek Alan Woodward added a comment - I'm wondering whether we should try to detect double rewrite So have an AssertingDoubleValuesSource? I like that idea, although I'm not sure where it would get injected into the tests. wouldn't LongValueSource be more fitting/appropriate? DoubleValuesSource is used in more places though (FunctionScoreQuery, Expressions module), so I think it's more useable this way. I guess I see LongValuesSource as a convenience wrapper, and DoubleValuesSource as the core API.
        Hide
        dsmiley David Smiley added a comment -

        all int-valued IndexReader methods now return a DoubleValuesSource, as an int can fit into a double with no loss of precision

        It's all well and good that there is no loss of precision but if the data is a whole number then wouldn't LongValueSource be more fitting/appropriate?

        Show
        dsmiley David Smiley added a comment - all int-valued IndexReader methods now return a DoubleValuesSource, as an int can fit into a double with no loss of precision It's all well and good that there is no loss of precision but if the data is a whole number then wouldn't LongValueSource be more fitting/appropriate?
        Hide
        jpountz Adrien Grand added a comment -

        Looking at how rewrite is handled again, I'm wondering whether we should try to detect double rewrite, which is most likely a usage bug?

        Show
        jpountz Adrien Grand added a comment - Looking at how rewrite is handled again, I'm wondering whether we should try to detect double rewrite, which is most likely a usage bug?
        Hide
        jpountz Adrien Grand added a comment -

        +1

        Show
        jpountz Adrien Grand added a comment - +1
        Hide
        jpountz Adrien Grand added a comment -

        +1

        Show
        jpountz Adrien Grand added a comment - +1
        Hide
        romseygeek Alan Woodward added a comment -

        I tried a few different ways of making these sources immutable, including holding a weak hashmap of reader ids on them, but in the end I think the cleanest API is to have a rewrite(IndexReader) method on Double/LongValuesSource. This requires a couple of changes elsewhere in the codebase - we already have a Sort.rewrite() method, but it wasn't being called by IndexSearcher or SortRescorer, for example - but I think ends up with things being neater.

        I removed norms() and joinDocFreq(); all int-valued IndexReader methods now return a DoubleValuesSource, as an int can fit into a double with no loss of precision. IndexReaderFunctions.sumTotalTermFreq() returns a LongValuesSource, as it delegates to a long-valued function. If a user wants to use this in an expression, or similar, then they'll need to explicitly cast using .toDoubleValuesSource().

        A nice extra feature here would be to wrap these up in a Bindings implementation for the expressions module, which would require moving back into core, but that's for a follow-up issue, I think.

        Show
        romseygeek Alan Woodward added a comment - I tried a few different ways of making these sources immutable, including holding a weak hashmap of reader ids on them, but in the end I think the cleanest API is to have a rewrite(IndexReader) method on Double/LongValuesSource. This requires a couple of changes elsewhere in the codebase - we already have a Sort.rewrite() method, but it wasn't being called by IndexSearcher or SortRescorer, for example - but I think ends up with things being neater. I removed norms() and joinDocFreq(); all int-valued IndexReader methods now return a DoubleValuesSource, as an int can fit into a double with no loss of precision. IndexReaderFunctions.sumTotalTermFreq() returns a LongValuesSource, as it delegates to a long-valued function. If a user wants to use this in an expression, or similar, then they'll need to explicitly cast using .toDoubleValuesSource(). A nice extra feature here would be to wrap these up in a Bindings implementation for the expressions module, which would require moving back into core, but that's for a follow-up issue, I think.
        Hide
        jpountz Adrien Grand added a comment -

        I like the termFreq and reader constant values sources, I can imagine them to be useful to users. Can you add equals/methods to these new impls (LUCENE-7723)? (I'm thinking that one easy way to make equals/hashcode work for them could be to simply make them singletons)

        DoubleValuesSource (or LongValuesSource, in the case of norms)

        Why are only norms a LongValuesSource? Should termFreq, docFreq, maxDoc and numDocs be longs as well?

        I'm curious whether you have a use-case for the joinDocFreq values sources, it looks very expensive and a bit esoteric to me. I'm also unsure how useful exposing raw norms is given that the value is an implementation detail of the similarity?

        The IndexReaderFunctions class is currently in the queries module, but it could easily be in core instead.

        +1 to keep in queries for now

        Show
        jpountz Adrien Grand added a comment - I like the termFreq and reader constant values sources, I can imagine them to be useful to users. Can you add equals/methods to these new impls ( LUCENE-7723 )? (I'm thinking that one easy way to make equals/hashcode work for them could be to simply make them singletons) DoubleValuesSource (or LongValuesSource, in the case of norms) Why are only norms a LongValuesSource? Should termFreq, docFreq, maxDoc and numDocs be longs as well? I'm curious whether you have a use-case for the joinDocFreq values sources, it looks very expensive and a bit esoteric to me. I'm also unsure how useful exposing raw norms is given that the value is an implementation detail of the similarity? The IndexReaderFunctions class is currently in the queries module, but it could easily be in core instead. +1 to keep in queries for now
        Hide
        romseygeek Alan Woodward added a comment -

        Here is a patch, implementing all the IndexReader-specific ValueSources as DoubleValuesSource (or LongValuesSource, in the case of norms), and adding a couple of extra ones that were missing (docCount, sumDocFreq, numDeletedDocs). The IndexReaderFunctions class is currently in the queries module, but it could easily be in core instead.

        I haven't added the Similarity-specific sources here (idf, tf) - they could maybe added directly to TFIDFSimilarity in another issue

        Show
        romseygeek Alan Woodward added a comment - Here is a patch, implementing all the IndexReader-specific ValueSources as DoubleValuesSource (or LongValuesSource, in the case of norms), and adding a couple of extra ones that were missing (docCount, sumDocFreq, numDeletedDocs). The IndexReaderFunctions class is currently in the queries module, but it could easily be in core instead. I haven't added the Similarity-specific sources here (idf, tf) - they could maybe added directly to TFIDFSimilarity in another issue

          People

          • Assignee:
            romseygeek Alan Woodward
            Reporter:
            romseygeek Alan Woodward
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development