Lucene - Core
  1. Lucene - Core
  2. LUCENE-2879

MultiPhraseQuery sums its own idf instead of Similarity.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/query/scoring
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      MultiPhraseQuery is a generalized version of PhraseQuery, and computes IDF the same way by default (by summing across the terms).

      The problem is it doesn't let the Similarity do this: PhraseQuery calls Similarity.idfExplain(Collection<Term> terms, IndexSearcher searcher),
      but MultiPhraseQuery just sums itself, calling Similarity.idf(int, int) for each term.

        Activity

        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1
        Hide
        Robert Muir added a comment -

        Committed revision 1062633 (trunk), 1062636 (branch_3x).

        If we do another 3.0.x/2.9.x we can consider this one then.

        Show
        Robert Muir added a comment - Committed revision 1062633 (trunk), 1062636 (branch_3x). If we do another 3.0.x/2.9.x we can consider this one then.
        Hide
        Robert Muir added a comment -

        Well its not a great solution really at all, and as you mention
        maybe its best to take care of it in the context of LUCENE-2878,
        but mainly I just didn't want us to forget about the problem.

        I'll move forward with this issue later today if there are no objections.

        Show
        Robert Muir added a comment - Well its not a great solution really at all, and as you mention maybe its best to take care of it in the context of LUCENE-2878 , but mainly I just didn't want us to forget about the problem. I'll move forward with this issue later today if there are no objections.
        Hide
        Doron Cohen added a comment -

        ok great I'll have a look there...

        Show
        Doron Cohen added a comment - ok great I'll have a look there...
        Hide
        Robert Muir added a comment -

        Doron, i hacked up a quick patch with some ideas for a fix for this one...
        from looking at the scores, I think its actually fairly serious for SpanQuery users.

        I'll open a separate issue for this.

        Show
        Robert Muir added a comment - Doron, i hacked up a quick patch with some ideas for a fix for this one... from looking at the scores, I think its actually fairly serious for SpanQuery users. I'll open a separate issue for this.
        Hide
        Doron Cohen added a comment -

        In my opinion this is an off-by-one

        Right, it would return 0.5 vs 1.0 in TermQuery.
        Perhaps similar issue with SpanNearQuery.
        (I guess this may be taken care of in LUCENE-2878.)

        Show
        Doron Cohen added a comment - In my opinion this is an off-by-one Right, it would return 0.5 vs 1.0 in TermQuery. Perhaps similar issue with SpanNearQuery. (I guess this may be taken care of in LUCENE-2878 .)
        Hide
        Robert Muir added a comment -

        A small thing that bothered me was that an explanation is created although the user did not call explain(), and in general explain() is considered slower, but it is called once per query, so it should not be a perf issue, and that's the case already for two other queries so anyhow this one (MFQ) should first be made consistent, which is done by this patch.

        Well, this IDFExplanation is confusing/tricky... so with a good implementation, its an abstract class so creating the "Explanation" does nothing really.

        Instead the explanation is calculated "lazily", only if you ask for it:

            /**
             * This should be calculated lazily if possible.
             * 
             * @return the explanation for the idf factor.
             */
            public abstract String explain();
        

        Not saying that the patch should change, just pointing out the difference in sum-of-square-weights computation between SpanWeight and MFQ.

        I saw this and it bothered me a bit as well too. But I suppose its ok, given that the whole thing is only an approximation anyway right?
        (In a lot of more "ordinary" short queries, the # of unique terms will be similar to # of terms).

        Additionally if this really bothered someone, they could work around it by putting all the terms into a HashSet in their IDF implementation to make
        PhraseQuery, MultiPhraseQuery work like SpanQuery.

        In general, when I look at the SpanQueries I am frustrated with other scoring problems.
        For example, I think that SpanScorer by default should be consistent with our other Queries.
        But imagine a Simple SpanTermQuery, its tf() calculation is done like this:

           while (matches) {
              int matchLength = spans.end() - spans.start();
              freq += similarity.sloppyFreq(matchLength);
           }
           ...
           similarity.tf(freq);
        

        In my opinion this is an off-by-one
        In the current implementation, this produces slop of 1 for an exact SpanTermQuery.
        if instead it were spans.end() - spans.start() - 1, it would produce a slop of 0,
        yielding a sloppyFreq of 1 for each match, and would equate exactly with TermQuery.

        Show
        Robert Muir added a comment - A small thing that bothered me was that an explanation is created although the user did not call explain(), and in general explain() is considered slower, but it is called once per query, so it should not be a perf issue, and that's the case already for two other queries so anyhow this one (MFQ) should first be made consistent, which is done by this patch. Well, this IDFExplanation is confusing/tricky... so with a good implementation, its an abstract class so creating the "Explanation" does nothing really. Instead the explanation is calculated "lazily", only if you ask for it: /** * This should be calculated lazily if possible. * * @return the explanation for the idf factor. */ public abstract String explain(); Not saying that the patch should change, just pointing out the difference in sum-of-square-weights computation between SpanWeight and MFQ. I saw this and it bothered me a bit as well too. But I suppose its ok, given that the whole thing is only an approximation anyway right? (In a lot of more "ordinary" short queries, the # of unique terms will be similar to # of terms). Additionally if this really bothered someone, they could work around it by putting all the terms into a HashSet in their IDF implementation to make PhraseQuery, MultiPhraseQuery work like SpanQuery. In general, when I look at the SpanQueries I am frustrated with other scoring problems. For example, I think that SpanScorer by default should be consistent with our other Queries. But imagine a Simple SpanTermQuery, its tf() calculation is done like this: while (matches) { int matchLength = spans.end() - spans.start(); freq += similarity.sloppyFreq(matchLength); } ... similarity.tf(freq); In my opinion this is an off-by-one In the current implementation, this produces slop of 1 for an exact SpanTermQuery. if instead it were spans.end() - spans.start() - 1, it would produce a slop of 0, yielding a sloppyFreq of 1 for each match, and would equate exactly with TermQuery.
        Hide
        Doron Cohen added a comment -

        +1 for fixing this inconsistent behavior.
        BTW also SpanWeight calls idfExplain() for same reason.
        Patch looks good, new test case passes with the fix and fails without it.

        A small thing that bothered me was that an explanation is created although the user did not call explain(), and in general explain() is considered slower, but it is called once per query, so it should not be a perf issue, and that's the case already for two other queries so anyhow this one (MFQ) should first be made consistent, which is done by this patch.

        It is interesting that the implementation of a similar logic in SpanWeight is more compact:

        SpanWeight: calls extractTerms()
        terms=new HashSet<Term>();
        query.extractTerms(terms);
        idfExp = similarity.idfExplain(terms, searcher);
        

        But doing the same in MFQ would change its logic, as it would consider each term only once.
        Not saying that the patch should change, just pointing out the difference in sum-of-square-weights computation between SpanWeight and MFQ.
        Boolean Query fore example, would iterate over its sub queries and sum theirs, and so, if it so happens that the same term appears in two descendant queries that term would contribute twice to the sum. In this sense, MFQ and BQ behave similarly, both differ from SpanQuery... well I guess this falls to the "black magic" area

        Show
        Doron Cohen added a comment - +1 for fixing this inconsistent behavior. BTW also SpanWeight calls idfExplain() for same reason. Patch looks good, new test case passes with the fix and fails without it. A small thing that bothered me was that an explanation is created although the user did not call explain(), and in general explain() is considered slower, but it is called once per query, so it should not be a perf issue, and that's the case already for two other queries so anyhow this one (MFQ) should first be made consistent, which is done by this patch. It is interesting that the implementation of a similar logic in SpanWeight is more compact: SpanWeight: calls extractTerms() terms= new HashSet<Term>(); query.extractTerms(terms); idfExp = similarity.idfExplain(terms, searcher); But doing the same in MFQ would change its logic, as it would consider each term only once. Not saying that the patch should change, just pointing out the difference in sum-of-square-weights computation between SpanWeight and MFQ. Boolean Query fore example, would iterate over its sub queries and sum theirs, and so, if it so happens that the same term appears in two descendant queries that term would contribute twice to the sum. In this sense, MFQ and BQ behave similarly, both differ from SpanQuery... well I guess this falls to the "black magic" area
        Hide
        Robert Muir added a comment -

        patch, i added a test for this.

        Show
        Robert Muir added a comment - patch, i added a test for this.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development