Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff of LUCENE-2879.

      You can see a full description there, but the gist is that SpanQuery sums up freqs with "sloppyFreq".
      However this slop is simply spans.end() - spans.start()

      For a SpanTermQuery for example, this means its scoring 0.5 for TF versus TermQuery's 1.0.
      As you can imagine, I think in practical situations this would make it difficult for SpanQuery users to
      really use SpanQueries for effective ranking, especially in combination with non-Spanqueries (maybe via DisjunctionMaxQuery, etc)

      The problem is more general than this simple example: for example SpanNearQuery should be consistent with PhraseQuery's slop.

      1. LUCENE-2880.patch
        10 kB
        Adrien Grand
      2. LUCENE-2880.patch
        13 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Here's a quickly hacked up patch (core tests pass, but i didnt go fixing contrib, etc yet).

          Its just to get ideas.

          The approach I took was for SpanQuery to have a new method:

            /** 
             * Returns the length (number of positions) in the query.
             * <p>
             * For example, for a simple Term this is 1.
             * For a NEAR of "foo" and "bar" this is 2.
             * This is used by SpanScorer to compute the appropriate slop factor,
             * so that SpanQueries score consistently with other queries.
             */
            public abstract int getLength();
          

          This is called once by the Weight, and passed to SpanScorer.

          Then SpanScorer computes the slop as:

          int matchLength = (spans.end() - spans.start()) - queryLength;
          

          instead of:

          int matchLength = spans.end() - spans.start();
          
          Show
          Robert Muir added a comment - Here's a quickly hacked up patch (core tests pass, but i didnt go fixing contrib, etc yet). Its just to get ideas. The approach I took was for SpanQuery to have a new method: /** * Returns the length (number of positions) in the query. * <p> * For example, for a simple Term this is 1. * For a NEAR of "foo" and "bar" this is 2. * This is used by SpanScorer to compute the appropriate slop factor, * so that SpanQueries score consistently with other queries. */ public abstract int getLength(); This is called once by the Weight, and passed to SpanScorer. Then SpanScorer computes the slop as: int matchLength = (spans.end() - spans.start()) - queryLength; instead of: int matchLength = spans.end() - spans.start();
          Hide
          Robert Muir added a comment -

          thinking about this one, for this to really work correctly with the current setup (e.g. with SpanOrQuery),
          this length might have to be in the Spans class...

          but with LUCENE-2878 we nuke this class, so we can keep the issue open to think about how
          the slop should be computed for these queries, i think just using the end - start is not the best.

          Show
          Robert Muir added a comment - thinking about this one, for this to really work correctly with the current setup (e.g. with SpanOrQuery), this length might have to be in the Spans class... but with LUCENE-2878 we nuke this class, so we can keep the issue open to think about how the slop should be computed for these queries, i think just using the end - start is not the best.
          Hide
          Paul Elschot added a comment -

          A related problem is that Spans does not have a weight (or whatever factor) of its own.
          Currently Spans can only be scored at the top level (by SpanScorer) and not when they are nested.
          In the nested case the only way to affect to score value is via the length.

          Show
          Paul Elschot added a comment - A related problem is that Spans does not have a weight (or whatever factor) of its own. Currently Spans can only be scored at the top level (by SpanScorer) and not when they are nested. In the nested case the only way to affect to score value is via the length.
          Hide
          Paul Elschot added a comment -

          The getLength() method may not be straightforward.

          Does the getLength() method in SpanQuery also work in the nested case when there is an spanOr over two spanQueries of different length?

          It may be necessary to add this length to Spans because of this.

          Some reasons for a negative match length:

          • multiple terms indexed at the same position,
          • span distance queries with the same subqueries.

          I wish I had a good solution for this, but I did not find one yet.

          Show
          Paul Elschot added a comment - The getLength() method may not be straightforward. Does the getLength() method in SpanQuery also work in the nested case when there is an spanOr over two spanQueries of different length? It may be necessary to add this length to Spans because of this. Some reasons for a negative match length: multiple terms indexed at the same position, span distance queries with the same subqueries. I wish I had a good solution for this, but I did not find one yet.
          Hide
          Robert Muir added a comment -

          Paul I agree, I think the only way it would work is to be in Spans itself,
          which is the real 'Scorer' for spanqueries. Because its wrong for SpanOrQuery
          to have a getLength() really... just like it would be wrong for BooleanQuery
          to know anything about phrase slops of its subqueries!

          we can just leave this issue open and see what happens with
          LUCENE-2878, and maybe a good solution will then be more obvious.

          Show
          Robert Muir added a comment - Paul I agree, I think the only way it would work is to be in Spans itself, which is the real 'Scorer' for spanqueries. Because its wrong for SpanOrQuery to have a getLength() really... just like it would be wrong for BooleanQuery to know anything about phrase slops of its subqueries! we can just leave this issue open and see what happens with LUCENE-2878 , and maybe a good solution will then be more obvious.
          Hide
          Adam Ringel added a comment -

          I subclassed DefaultSimilarity to work around this.
          Seemed simple enough.

          public class LUCENE2880_SloppyFreqDistanceAdjuster {
          	private static Logger logger = Logger.getLogger(LUCENE2880_SloppyFreqDistanceAdjuster.class);
          
          	public int distance(int distance) {
          		if(distance < 2) {
          			logger.warn("distance - distacne is <, 2, has LUCENE-2880 been resolved?");
          			return 0;
          		}
          
          		return distance - 2;
          	}
          
          }
          
          public class LUCENE2880_DefaultSimilarity extends DefaultSimilarity {
          	private static final long serialVersionUID = 1L;
          	private static final LUCENE2880_SloppyFreqDistanceAdjuster ADJUSTER = new LUCENE2880_SloppyFreqDistanceAdjuster();
          
          	@Override
          	public float sloppyFreq(int distance) {
          		return super.sloppyFreq(ADJUSTER.distance(distance));
          	}
          
          }
          
          Show
          Adam Ringel added a comment - I subclassed DefaultSimilarity to work around this. Seemed simple enough. public class LUCENE2880_SloppyFreqDistanceAdjuster { private static Logger logger = Logger.getLogger(LUCENE2880_SloppyFreqDistanceAdjuster.class); public int distance( int distance) { if (distance < 2) { logger.warn( "distance - distacne is <, 2, has LUCENE-2880 been resolved?" ); return 0; } return distance - 2; } } public class LUCENE2880_DefaultSimilarity extends DefaultSimilarity { private static final long serialVersionUID = 1L; private static final LUCENE2880_SloppyFreqDistanceAdjuster ADJUSTER = new LUCENE2880_SloppyFreqDistanceAdjuster(); @Override public float sloppyFreq( int distance) { return super .sloppyFreq(ADJUSTER.distance(distance)); } }
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.
          Hide
          Adrien Grand added a comment -

          Here is an updated patch that moves the method to the Spans class as suggested.

          SpanTermQuery now scores like TermQuery and an ordered SpanNearQuery scores like a PhraseQuery where all terms are at consecutive positions (the common case).

          Show
          Adrien Grand added a comment - Here is an updated patch that moves the method to the Spans class as suggested. SpanTermQuery now scores like TermQuery and an ordered SpanNearQuery scores like a PhraseQuery where all terms are at consecutive positions (the common case).
          Hide
          David Smiley added a comment -

          +1
          Wow this is simpler than I thought it would be, based on the title & description any way.

          Show
          David Smiley added a comment - +1 Wow this is simpler than I thought it would be, based on the title & description any way.
          Hide
          Alan Woodward added a comment - - edited

          +1

          Maybe width rather than distance as the method name?

          Show
          Alan Woodward added a comment - - edited +1 Maybe width rather than distance as the method name?
          Hide
          Adrien Grand added a comment -

          OK for width, I'll commit with distance renamed as width if there are no objections.

          Show
          Adrien Grand added a comment - OK for width, I'll commit with distance renamed as width if there are no objections.
          Hide
          ASF subversion and git services added a comment -

          Commit 1686301 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1686301 ]

          LUCENE-2880: Make span queries score more consistently with regular queries.

          Show
          ASF subversion and git services added a comment - Commit 1686301 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1686301 ] LUCENE-2880 : Make span queries score more consistently with regular queries.
          Hide
          ASF subversion and git services added a comment -

          Commit 1686308 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1686308 ]

          LUCENE-2880: Make span queries score more consistently with regular queries.

          Show
          ASF subversion and git services added a comment - Commit 1686308 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1686308 ] LUCENE-2880 : Make span queries score more consistently with regular queries.
          Hide
          ASF subversion and git services added a comment -

          Commit 1686337 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1686337 ]

          LUCENE-2880: Relax assertion: span near and phrase queries don't have the same scores if they wrap twice the same term.

          Show
          ASF subversion and git services added a comment - Commit 1686337 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1686337 ] LUCENE-2880 : Relax assertion: span near and phrase queries don't have the same scores if they wrap twice the same term.
          Hide
          ASF subversion and git services added a comment -

          Commit 1686339 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1686339 ]

          LUCENE-2880: Relax assertion: span near and phrase queries don't have the same scores if they wrap twice the same term.

          Show
          ASF subversion and git services added a comment - Commit 1686339 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1686339 ] LUCENE-2880 : Relax assertion: span near and phrase queries don't have the same scores if they wrap twice the same term.
          Hide
          Shalin Shekhar Mangar added a comment -

          Bulk close for 5.3.0 release

          Show
          Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development