Lucene - Core
  1. Lucene - Core
  2. LUCENE-3215

SloppyPhraseScorer sometimes computes Infinite freq

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New
    1. LUCENE-3215_test.patch
      2 kB
      Robert Muir
    2. LUCENE-3215_test.patch
      3 kB
      Robert Muir
    3. LUCENE-3215.patch
      33 kB
      Doron Cohen
    4. LUCENE-3215.patch
      33 kB
      Doron Cohen
    5. LUCENE-3215.patch
      34 kB
      Doron Cohen
    6. LUCENE-3215.patch
      6 kB
      Robert Muir

      Activity

      Hide
      Robert Muir added a comment -

      test case

      Show
      Robert Muir added a comment - test case
      Hide
      Robert Muir added a comment -

      the problem in this case, is it computes a 'matchLength' of -1.
      then the default impl of sloppyFreq divides by zero, because its defined as:

      return 1.0f / (distance + 1);
      
      Show
      Robert Muir added a comment - the problem in this case, is it computes a 'matchLength' of -1. then the default impl of sloppyFreq divides by zero, because its defined as: return 1.0f / (distance + 1);
      Hide
      Robert Muir added a comment -

      those lyrics are entertaining, but here is a more concise test.

      Show
      Robert Muir added a comment - those lyrics are entertaining, but here is a more concise test.
      Hide
      Robert Muir added a comment -

      It seems to me the root cause of the bug is having position 'holes' (e.g. stopwords), especially across duplicated terms...

      For example, if you disable position increments in queryparser it won't trigger the bug.

      Show
      Robert Muir added a comment - It seems to me the root cause of the bug is having position 'holes' (e.g. stopwords), especially across duplicated terms... For example, if you disable position increments in queryparser it won't trigger the bug.
      Hide
      Robert Muir added a comment -

      Here's a hack patch...

      I tried writing some correctness tests here but its really hard for me to visualize what the freq should even be for these "sloppy repeating phrase queries with holes".

      Show
      Robert Muir added a comment - Here's a hack patch... I tried writing some correctness tests here but its really hard for me to visualize what the freq should even be for these "sloppy repeating phrase queries with holes".
      Hide
      Doron Cohen added a comment -

      An update on this...

      This is not related to LUCENE-3142 - the latter was fixed but this one still fails.

      The patch fix which 'abs' the distance indeed avoids the infinite score problem, but I was not 100% comfortable with it - how can the distance be none positive?

      Digging into it shows a wrong assumption in SloppyPhraseScorer:

          private int initPhrasePositions() throws IOException {
              int end = 0;
      

      The initial value of end assumes that all positions will be nonnegative.
      But this is wrong, as PP position is computed as

        position = postings.nextPosition() - offset
      

      So, whenever the query term appears in the doc in a position smaller than its offset in the query, the computed position is negative. The correct initialization for end is therefore:

          private int initPhrasePositions() throws IOException {
              int end = Integer.MIN_VALUE;
      

      You would expect this bug to surfaced sooner...

      Anyhow, for the 3 tests that Robert added, this only resolve testInfiniteFreq1() but the other two tests still fail, investigating...

      Show
      Doron Cohen added a comment - An update on this... This is not related to LUCENE-3142 - the latter was fixed but this one still fails. The patch fix which 'abs' the distance indeed avoids the infinite score problem, but I was not 100% comfortable with it - how can the distance be none positive? Digging into it shows a wrong assumption in SloppyPhraseScorer: private int initPhrasePositions() throws IOException { int end = 0; The initial value of end assumes that all positions will be nonnegative. But this is wrong, as PP position is computed as position = postings.nextPosition() - offset So, whenever the query term appears in the doc in a position smaller than its offset in the query, the computed position is negative. The correct initialization for end is therefore: private int initPhrasePositions() throws IOException { int end = Integer .MIN_VALUE; You would expect this bug to surfaced sooner... Anyhow, for the 3 tests that Robert added, this only resolve testInfiniteFreq1() but the other two tests still fail, investigating...
      Hide
      Doron Cohen added a comment - - edited

      OK I think I have a fix for this.

      While looking at it, I realized that PhraseScorer (the one that used to base both Exact&Sloppy phrase scorers but now is the base of only sloppy phrase scorer) is way too complicated and inefficient. All those sort calls after each matching doc can be avoided.

      So I am modifying PhraseScorer to not have a phrase-queue at all - just the sorted linked list, which is always kept sorted by advancing last beyond first. Last is renamed to 'min' and first is renamed to 'max'. Making the list cyclic allows more efficient manipulation of it.

      With this, SloppyPhraseScorer is modified to maintain its own phrase queue. The queue size is set at the first candidate document. In order to handle repetitions (Same term in different query offsets) it will contain only some of the pps: those that either have no repetitions, or are the first (lower query offset) in a repeating group. A linked list of repeating pps was added: so PhrasePositions has a new member: nextRepeating.

      Detection of repeating pps and creation of that list is done once per scorer: at the first candidate doc.

      For solving the bugs reported here, in addition to the initiation of 'end' as explained in previous comment, advanceRepeatingPPs now also update two values:

      • end, in case one of the repeating pps is far ahead (larger)
      • position of the first pp in a repeating list (the one that is in the queue - in case the repeating pp is far behind (smaller). This can happen when there are holes in the query, as position = tpPOs - offset. It fixes the problem of false negative distances which caused this bug. It is tricky: relies on that PhrasePositions.nextPosition() ignores pp.position and just call positions.nextPosition(). But it is correct, as the modified position is used to replace pp in the queue.

      Last, I think that the test added with holes had one wrong assert: It added four docs:

      • drug drug
      • drug druggy drug
      • drug druggy druggy drug
      • drug druggy drug druggy drug

      defined this query (number is the offset):

      • drug(1) drug(3)

      and expected that with slop=1 the first doc would not be found.
      I think it should be found, as the slop operates in both directions.
      So modified the query to: drug(1) drug(3)

      Patch to follow.

      Show
      Doron Cohen added a comment - - edited OK I think I have a fix for this. While looking at it, I realized that PhraseScorer (the one that used to base both Exact&Sloppy phrase scorers but now is the base of only sloppy phrase scorer) is way too complicated and inefficient. All those sort calls after each matching doc can be avoided. So I am modifying PhraseScorer to not have a phrase-queue at all - just the sorted linked list, which is always kept sorted by advancing last beyond first. Last is renamed to 'min' and first is renamed to 'max'. Making the list cyclic allows more efficient manipulation of it. With this, SloppyPhraseScorer is modified to maintain its own phrase queue. The queue size is set at the first candidate document. In order to handle repetitions (Same term in different query offsets) it will contain only some of the pps: those that either have no repetitions, or are the first (lower query offset) in a repeating group. A linked list of repeating pps was added: so PhrasePositions has a new member: nextRepeating. Detection of repeating pps and creation of that list is done once per scorer: at the first candidate doc. For solving the bugs reported here, in addition to the initiation of 'end' as explained in previous comment, advanceRepeatingPPs now also update two values: end, in case one of the repeating pps is far ahead (larger) position of the first pp in a repeating list (the one that is in the queue - in case the repeating pp is far behind (smaller). This can happen when there are holes in the query, as position = tpPOs - offset. It fixes the problem of false negative distances which caused this bug. It is tricky: relies on that PhrasePositions.nextPosition() ignores pp.position and just call positions.nextPosition(). But it is correct, as the modified position is used to replace pp in the queue. Last, I think that the test added with holes had one wrong assert: It added four docs: drug drug drug druggy drug drug druggy druggy drug drug druggy drug druggy drug defined this query (number is the offset): drug(1) drug(3) and expected that with slop=1 the first doc would not be found. I think it should be found, as the slop operates in both directions. So modified the query to: drug(1) drug(3) Patch to follow.
      Hide
      Doron Cohen added a comment -

      Attached patch is based on r1166541 - before recent changes to scorers. Will merge with recent changes tomorrow or so. All tests pass.
      I believe that sloppy scoring performance should improve with this change but did not check this.

      Show
      Doron Cohen added a comment - Attached patch is based on r1166541 - before recent changes to scorers. Will merge with recent changes tomorrow or so. All tests pass. I believe that sloppy scoring performance should improve with this change but did not check this.
      Hide
      Doron Cohen added a comment -

      Updated patch for current trunk r1172055.

      Show
      Doron Cohen added a comment - Updated patch for current trunk r1172055.
      Hide
      Doron Cohen added a comment -

      Previous patch still produces NANs and infinite scores with holes.
      Updated patch is fixing this, by updating END (before computing the new match-length) also for pp (not only for its repeats).

      I plan to commit this soon.

      Show
      Doron Cohen added a comment - Previous patch still produces NANs and infinite scores with holes. Updated patch is fixing this, by updating END (before computing the new match-length) also for pp (not only for its repeats). I plan to commit this soon.
      Hide
      Doron Cohen added a comment -

      Fixed

      • r1173961 - trunk
      • r1174002 - 3x

      Prior to committing I compared the performance of sloppy phrase queries with/out repeats for large documents with many candidate matches and did not see the anticipated speedup, though, at least, no degradations as well.

      Show
      Doron Cohen added a comment - Fixed r1173961 - trunk r1174002 - 3x Prior to committing I compared the performance of sloppy phrase queries with/out repeats for large documents with many candidate matches and did not see the anticipated speedup, though, at least, no degradations as well.
      Hide
      Uwe Schindler added a comment -

      Bulk close after release of 3.5

      Show
      Uwe Schindler added a comment - Bulk close after release of 3.5

        People

        • Assignee:
          Doron Cohen
          Reporter:
          Robert Muir
        • Votes:
          1 Vote for this issue
          Watchers:
          2 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development