Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: flexscoring branch
    • Fix Version/s: flexscoring branch
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      In order to support ranking methods besides TF-IDF, we need to make the statistics they need available. These statistics could be computed in computeWeight (soon to become computeStats) and stored in a separate object for easy access. Since this object will be used solely by subclasses of Similarity, it should be implented as a static inner class, i.e. Similarity.Stats.

      There are two ways this could be implemented:

      • as a single Similarity.Stats class, reused by all ranking algorithms. In this case, this class would have a member field for all statistics;
      • as a hierarchy of Stats classes, one for each ranking algorithm. Each subclass would define only the statistics needed for the ranking algorithm.

      In the second case, the Stats class in DefaultSimilarity would have a single field, idf, while the one in e.g. BM25Similarity would have idf and average field/document length.

      1. LUCENE-3174.patch
        14 kB
        David Mark Nemeskey
      2. LUCENE-3174.patch
        14 kB
        David Mark Nemeskey
      3. LUCENE-3174.patch
        42 kB
        David Mark Nemeskey
      4. LUCENE-3174.patch
        62 kB
        Robert Muir
      5. LUCENE-3174.patch
        63 kB
        Robert Muir
      6. LUCENE-3174.patch
        66 kB
        Robert Muir
      7. LUCENE-3174_normalize_boost.patch
        13 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        thanks David!

        Show
        Robert Muir added a comment - thanks David!
        Hide
        Robert Muir added a comment -

        i fixed a few problems: javadocs warnings and also the fact that i had left an assert commented out from hair-pulling with CustomScoreQuery.

        Show
        Robert Muir added a comment - i fixed a few problems: javadocs warnings and also the fact that i had left an assert commented out from hair-pulling with CustomScoreQuery.
        Hide
        Robert Muir added a comment -

        here's the patch with the unrelated bug fixed in CustomScoreQuery.

        now all tests pass.

        Show
        Robert Muir added a comment - here's the patch with the unrelated bug fixed in CustomScoreQuery. now all tests pass.
        Hide
        Robert Muir added a comment -

        here's an updated patch, i pushed query normalization into the Stats, and removed idf/etc from the weight impls.

        I think this is close, all tests pass except TestCustomScoreQuery (its some explanation problem). I'm this close to @Ignoring it, since the query nor the test make any sense.

        Show
        Robert Muir added a comment - here's an updated patch, i pushed query normalization into the Stats, and removed idf/etc from the weight impls. I think this is close, all tests pass except TestCustomScoreQuery (its some explanation problem). I'm this close to @Ignoring it, since the query nor the test make any sense.
        Hide
        Robert Muir added a comment -

        Almost completely removed idf from the Weights – it still lingers in explain().

        Right, explain() is a big TODO of a refactoring job, you did the right thing, its not easily solved until we refactor it big-time so that any arbitrary Similarity can explain its own scoring. Not to make any promises, but I think by doing such a thing (letting a Similarity control how the explaining works), we will make progress towards LUCENE-3118 too: if you customize the scoring system for your app, you should be able to explain the scores in a way that make sense to your app too.

        The DocScorer factory methods now need both the Weight and the Stats; that's the best I could do for now.

        This sounds like a good step to me! We want to just pass only the Stats to the DocScorer factory methods, but we have some more work to do before that... such as better handling of the whole boosting situation and pushing all responsibility for query normalization into stats.

        once we have done this, i think Weight/Stats will make sense (except for naming) as it will be be the parallel of Scorer/DocScorer, full responsibility for scoring is in the Similarity and Weight/Scorer only handle things like seeking to terms, creating docsenums, iterating postings lists, etc

        Show
        Robert Muir added a comment - Almost completely removed idf from the Weights – it still lingers in explain(). Right, explain() is a big TODO of a refactoring job, you did the right thing, its not easily solved until we refactor it big-time so that any arbitrary Similarity can explain its own scoring. Not to make any promises, but I think by doing such a thing (letting a Similarity control how the explaining works), we will make progress towards LUCENE-3118 too: if you customize the scoring system for your app, you should be able to explain the scores in a way that make sense to your app too. The DocScorer factory methods now need both the Weight and the Stats; that's the best I could do for now. This sounds like a good step to me! We want to just pass only the Stats to the DocScorer factory methods, but we have some more work to do before that... such as better handling of the whole boosting situation and pushing all responsibility for query normalization into stats. once we have done this, i think Weight/Stats will make sense (except for naming) as it will be be the parallel of Scorer/DocScorer, full responsibility for scoring is in the Similarity and Weight/Scorer only handle things like seeking to terms, creating docsenums, iterating postings lists, etc
        Hide
        David Mark Nemeskey added a comment -

        Almost completely removed idf from the Weights – it still lingers in explain(). The DocScorer factory methods now need both the Weight and the Stats; that's the best I could do for now.

        Robert: if you don't mind, I have included you patch in this one as well.

        Show
        David Mark Nemeskey added a comment - Almost completely removed idf from the Weights – it still lingers in explain(). The DocScorer factory methods now need both the Weight and the Stats; that's the best I could do for now. Robert: if you don't mind, I have included you patch in this one as well.
        Hide
        Robert Muir added a comment -

        make BooleanWeight & friends independent of the normalization method?

        I don't like how they rely upon the normalization to integrate their boost to the lower level queries... I think we should always keep this 'separate' so a sim can handle the boost how it wants... here's a patch (probably not the ultimate end-all but a step)

        Show
        Robert Muir added a comment - make BooleanWeight & friends independent of the normalization method? I don't like how they rely upon the normalization to integrate their boost to the lower level queries... I think we should always keep this 'separate' so a sim can handle the boost how it wants... here's a patch (probably not the ultimate end-all but a step)
        Hide
        David Mark Nemeskey added a comment -

        Ok, let's go with EasySimilarity then.

        TODOs (according to the irc chat):

        • move sumOfSquaredWeights() and normalize to Stats, so that Weight doesn't need to know about IDF
        • rename sumOfSquaredWeights()
        • make BooleanWeight & friends independent of the normalization method? (currently BooleanWeight expects the sqrt() in DefaultSimilarityProvider).
        Show
        David Mark Nemeskey added a comment - Ok, let's go with EasySimilarity then. TODOs (according to the irc chat): move sumOfSquaredWeights() and normalize to Stats, so that Weight doesn't need to know about IDF rename sumOfSquaredWeights() make BooleanWeight & friends independent of the normalization method? (currently BooleanWeight expects the sqrt() in DefaultSimilarityProvider).
        Hide
        Robert Muir added a comment -

        Right, but there are some serious disadvantages:

        • we don't know what custom stats someone might want to integrate... it could be computed based however they want.
        • we might add newer stats, but keeping it opaque to the sim relieves us of backwards compatibility hassles.

        for this reason i would like it completely opaque... its not like you are gonna have to cast everywhere, only a
        single time when creating your docscorer.

        I agree with you that for researchers, they want to see all the ordinary IR stats available like terrier or whatever
        We should just make an EasySimilarity extends Similarity for them, that returns AllStats with all these common ones.

        But I think other people are going to want to extend lucene to their domain so keeping it opaque is best, so they can do this.

        Show
        Robert Muir added a comment - Right, but there are some serious disadvantages: we don't know what custom stats someone might want to integrate... it could be computed based however they want. we might add newer stats, but keeping it opaque to the sim relieves us of backwards compatibility hassles. for this reason i would like it completely opaque... its not like you are gonna have to cast everywhere, only a single time when creating your docscorer. I agree with you that for researchers, they want to see all the ordinary IR stats available like terrier or whatever We should just make an EasySimilarity extends Similarity for them, that returns AllStats with all these common ones. But I think other people are going to want to extend lucene to their domain so keeping it opaque is best, so they can do this.
        Hide
        David Mark Nemeskey added a comment -

        I would disagree in this case, i think a composite similarity that has N sub-similarities would just return a MultiStats that keeps these in an array, as this composite doesnt care at all whats in them, it just needs to be able to delegate them back to the sub's docscorers later.

        I didn't think of that. I really like this idea.

        As for Stats, I see several advantages of a single class:

        • no need for casting. It may be just me, but having to cast everywhere doesn't feel right for me;
        • we show in one place what statistics the ranking algorithms use, the user of the library doesn't need to "hunt" for this information;
        • I think there will be Similarities that use the same Stats subclass, e.g. MockLMSimilarity uses TFIDFSimilarity.IDFStats. Or it could define its own Stats that looks exactly the same. Either solution seems a bit strange for me;
        • one less class to write if you want to add a new Similarity (provided you don't need a new statistic, in which case you have to write your own and cast it).
        Show
        David Mark Nemeskey added a comment - I would disagree in this case, i think a composite similarity that has N sub-similarities would just return a MultiStats that keeps these in an array, as this composite doesnt care at all whats in them, it just needs to be able to delegate them back to the sub's docscorers later. I didn't think of that. I really like this idea. As for Stats, I see several advantages of a single class: no need for casting. It may be just me, but having to cast everywhere doesn't feel right for me; we show in one place what statistics the ranking algorithms use, the user of the library doesn't need to "hunt" for this information; I think there will be Similarities that use the same Stats subclass, e.g. MockLMSimilarity uses TFIDFSimilarity.IDFStats. Or it could define its own Stats that looks exactly the same. Either solution seems a bit strange for me; one less class to write if you want to add a new Similarity (provided you don't need a new statistic, in which case you have to write your own and cast it).
        Hide
        Robert Muir added a comment -

        However, passing Stats to the methods you mentioned would only be possible if Stats already defined every possible statistic, either as public members or getter methods. I don't mind if it becomes like that;

        I don't think anything needs to be in Stats itself. If i write BM25Similarity, then i make my own BM25Similarity.BM25Stats and put what i need in it. its passed to my docscorer as Stats and I cast to BM25Stats...done.

        Also, I am thinking of leaving idf out of Stats in favor of df, and doing the computation in the DocScorers. This would make it possible to reuse the same Stats object e.g. for composite Similarities.

        I would disagree in this case, i think a composite similarity that has N sub-similarities would just return a MultiStats that keeps these in an array, as this composite doesnt care at all whats in them, it just needs to be able to delegate them back to the sub's docscorers later.

        Show
        Robert Muir added a comment - However, passing Stats to the methods you mentioned would only be possible if Stats already defined every possible statistic, either as public members or getter methods. I don't mind if it becomes like that; I don't think anything needs to be in Stats itself. If i write BM25Similarity, then i make my own BM25Similarity.BM25Stats and put what i need in it. its passed to my docscorer as Stats and I cast to BM25Stats...done. Also, I am thinking of leaving idf out of Stats in favor of df, and doing the computation in the DocScorers. This would make it possible to reuse the same Stats object e.g. for composite Similarities. I would disagree in this case, i think a composite similarity that has N sub-similarities would just return a MultiStats that keeps these in an array, as this composite doesnt care at all whats in them, it just needs to be able to delegate them back to the sub's docscorers later.
        Hide
        David Mark Nemeskey added a comment -

        Made Similarity.Stats static.

        However, passing Stats to the methods you mentioned would only be possible if Stats already defined every possible statistic, either as public members or getter methods. I don't mind if it becomes like that; Similarity subclasses would not be required to compute all statistics, only the once they need.

        Also, I am thinking of leaving idf out of Stats in favor of df, and doing the computation in the DocScorers. This would make it possible to reuse the same Stats object e.g. for composite Similarities.

        Show
        David Mark Nemeskey added a comment - Made Similarity.Stats static. However, passing Stats to the methods you mentioned would only be possible if Stats already defined every possible statistic, either as public members or getter methods. I don't mind if it becomes like that; Similarity subclasses would not be required to compute all statistics, only the once they need. Also, I am thinking of leaving idf out of Stats in favor of df, and doing the computation in the DocScorers. This would make it possible to reuse the same Stats object e.g. for composite Similarities.
        Hide
        Robert Muir added a comment -

        Hi David, after reviewing the patch, I think we should do this:

        • make Similarity.Stats static
        • pass this, instead of Weight, to exactDocScorer() and sloppyDocScorer(). this should fix the MockBM25Sim issue as it wont need to hold a stats since its passed here.
        Show
        Robert Muir added a comment - Hi David, after reviewing the patch, I think we should do this: make Similarity.Stats static pass this, instead of Weight, to exactDocScorer() and sloppyDocScorer(). this should fix the MockBM25Sim issue as it wont need to hold a stats since its passed here.
        Hide
        David Mark Nemeskey added a comment -

        Here's what the patch does:

        • it introduces the Similarity.Stats class and its subclasses
        • renames computeWeight() to computeStats()
        • fixes methods that call computeStats()

        What remains to be done:

        • rewrite the javadoc
        • Stats will be used inside other Similarity methods: its availability should be unsured somehow. The current solution in MockBM25Similarity is not satisfactory because there is only one Similarity object at a time.
        • MultiPhraseWeight, PhraseWeight, SpanWeight, TermWeight call computeStats and extract the IDFExplain object. This level of coupling is not desirable, and should be eliminated. All the more so, as not all Similarity subclasses will have an idf
        • It might not even make sense to expose computeStats()?

        To consider:

        • it might be better if Stats were static, because they could inherit fields from each other
        Show
        David Mark Nemeskey added a comment - Here's what the patch does: it introduces the Similarity.Stats class and its subclasses renames computeWeight() to computeStats() fixes methods that call computeStats() What remains to be done: rewrite the javadoc Stats will be used inside other Similarity methods: its availability should be unsured somehow. The current solution in MockBM25Similarity is not satisfactory because there is only one Similarity object at a time. MultiPhraseWeight, PhraseWeight, SpanWeight, TermWeight call computeStats and extract the IDFExplain object. This level of coupling is not desirable, and should be eliminated. All the more so, as not all Similarity subclasses will have an idf It might not even make sense to expose computeStats()? To consider: it might be better if Stats were static, because they could inherit fields from each other
        Hide
        David Mark Nemeskey added a comment -

        Patch v0.1

        Show
        David Mark Nemeskey added a comment - Patch v0.1

          People

          • Assignee:
            David Mark Nemeskey
            Reporter:
            David Mark Nemeskey
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development