Lucene - Core
  1. Lucene - Core
  2. LUCENE-2668

offset gap should be added regardless of existence of tokens in DocInverterPerField

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.9.3, 3.0.2, 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Problem: If a multiValued field which contains a stop word (e.g. "will" in the following sample) only value is analyzed by StopAnalyzer when indexing, the offsets of the subsequent tokens are not correct.

      indexing a multiValued field
      doc.add( new Field( F, "Mike", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) );
      doc.add( new Field( F, "will", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) );
      doc.add( new Field( F, "use", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) );
      doc.add( new Field( F, "Lucene", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) );
      

      In this program (soon to be attached), if you use WhitespaceAnalyzer, you'll get the offset(start,end) for "use" and "Lucene" will be use(10,13) and Lucene(14,20). But if you use StopAnalyzer, the offsets will be use(9,12) and lucene(13,19). When searching, since searcher cannot know what analyzer was used at indexing time, this problem causes out of alignment of FVH.

      Cause of the problem: StopAnalyzer filters out "will", anyToken flag set to false then offset gap is not added in DocInverterPerField:

      DocInverterPerField.java
      if (anyToken)
        fieldState.offset += docState.analyzer.getOffsetGap(field);
      

      I don't understand why the condition is there... If always the gap is added, I think things are simple.

      1. LUCENE-2668.patch
        13 kB
        Koji Sekiguchi
      2. LUCENE-2668.patch
        11 kB
        Koji Sekiguchi
      3. LUCENE-2668.patch
        1 kB
        Koji Sekiguchi
      4. Test.java
        3 kB
        Koji Sekiguchi

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1
          Hide
          David Smiley added a comment -

          Apologies; I meant to post to LUCENE-2529

          Show
          David Smiley added a comment - Apologies; I meant to post to LUCENE-2529
          Hide
          Michael McCandless added a comment -

          do you know why it conditionally decrements the position and then increments it later unconditionally?

          This was done in LUCENE-1542, to prevent position increment from becoming negative, which messes up positions if the field has a payload. Basically we cannot store a -1 position if the field has a payload.

          But I agree it's awful that we are inconsistent and it requires hacks for apps that have 0 posIncr on the first token. Not sure how to best fix this though...

          Show
          Michael McCandless added a comment - do you know why it conditionally decrements the position and then increments it later unconditionally? This was done in LUCENE-1542 , to prevent position increment from becoming negative, which messes up positions if the field has a payload. Basically we cannot store a -1 position if the field has a payload. But I agree it's awful that we are inconsistent and it requires hacks for apps that have 0 posIncr on the first token. Not sure how to best fix this though...
          Hide
          David Smiley added a comment -

          So to anyone who's commented on this issue that has done work on this class... do you know why it conditionally decrements the position and then increments it later unconditionally? Reference lines 156 & 188. The very fact that it happens sometimes but not others is thwarting my efforts to have the term positions between multiple fields coincide (special purpose use case I have). I'm using a position filter that ensures that all terms for a value have 0 position increment gap, even the first one. But sometimes I have no value or I have a value that is a stop word. My hacky work-around is to set the first value to each of these multi-valued fields be some dummy value that gets indexed. This is ugly and wasteful on disk.

          Show
          David Smiley added a comment - So to anyone who's commented on this issue that has done work on this class... do you know why it conditionally decrements the position and then increments it later unconditionally? Reference lines 156 & 188. The very fact that it happens sometimes but not others is thwarting my efforts to have the term positions between multiple fields coincide (special purpose use case I have). I'm using a position filter that ensures that all terms for a value have 0 position increment gap, even the first one. But sometimes I have no value or I have a value that is a stop word. My hacky work-around is to set the first value to each of these multi-valued fields be some dummy value that gets indexed. This is ugly and wasteful on disk.
          Hide
          Koji Sekiguchi added a comment -

          trunk: Committed revision 1001796 and 1001957.
          branch_3x: Committed revision 1001991.

          Show
          Koji Sekiguchi added a comment - trunk: Committed revision 1001796 and 1001957. branch_3x: Committed revision 1001991.
          Hide
          Koji Sekiguchi added a comment -

          New patch attached. It includes LUCENE-2529 and its test case. I added setter/getter for positionIncrementGap to MockAnalyzer.

          Show
          Koji Sekiguchi added a comment - New patch attached. It includes LUCENE-2529 and its test case. I added setter/getter for positionIncrementGap to MockAnalyzer.
          Hide
          Koji Sekiguchi added a comment -

          Ok, I'll see this issue and LUCENE-2529 all together.

          Show
          Koji Sekiguchi added a comment - Ok, I'll see this issue and LUCENE-2529 all together.
          Hide
          Yonik Seeley added a comment -

          Seems like we should fix LUCENE-2529 for the positions too?

          Yep, really seems like almost the same issue.

          Show
          Yonik Seeley added a comment - Seems like we should fix LUCENE-2529 for the positions too? Yep, really seems like almost the same issue.
          Hide
          Robert Muir added a comment -

          Seems like we should fix LUCENE-2529 for the positions too?

          Show
          Robert Muir added a comment - Seems like we should fix LUCENE-2529 for the positions too?
          Hide
          Michael McCandless added a comment -

          I'd doubt there's a single one relying on the current behavior - in fact I think it's more likely that there's an app out there relying on the proposed behavior and they just haven't hit the case when no tokens were indexed for a field value.

          OK so let's just break it and advertise this in the back compat breaks...

          Show
          Michael McCandless added a comment - I'd doubt there's a single one relying on the current behavior - in fact I think it's more likely that there's an app out there relying on the proposed behavior and they just haven't hit the case when no tokens were indexed for a field value. OK so let's just break it and advertise this in the back compat breaks...
          Hide
          Yonik Seeley added a comment -

          Or do we think apps are not relying on this quirky behavior?

          I'd doubt there's a single one relying on the current behavior - in fact I think it's more likely that there's an app out there relying on the proposed behavior and they just haven't hit the case when no tokens were indexed for a field value.

          Show
          Yonik Seeley added a comment - Or do we think apps are not relying on this quirky behavior? I'd doubt there's a single one relying on the current behavior - in fact I think it's more likely that there's an app out there relying on the proposed behavior and they just haven't hit the case when no tokens were indexed for a field value.
          Hide
          Michael McCandless added a comment -

          +1

          But, what about index back compat? Should we switch this under Version? Or do we think apps are not relying on this quirky behavior?

          In the future, eg w/ write-once attr bindings in the analysis chain (LUCENE-2450), which lets us fully decouple analysis and indexing, how pos/offset gaps are added for multi-valued fields will be fully under the analyzer's control...

          Show
          Michael McCandless added a comment - +1 But, what about index back compat? Should we switch this under Version? Or do we think apps are not relying on this quirky behavior? In the future, eg w/ write-once attr bindings in the analysis chain ( LUCENE-2450 ), which lets us fully decouple analysis and indexing, how pos/offset gaps are added for multi-valued fields will be fully under the analyzer's control...
          Hide
          Koji Sekiguchi added a comment -

          Thanks Yonik for the comment.

          Here is the final patch. I'd like to commit in a few days if no one objects.

          Show
          Koji Sekiguchi added a comment - Thanks Yonik for the comment. Here is the final patch. I'd like to commit in a few days if no one objects.
          Hide
          Yonik Seeley added a comment -

          always adds offset gap

          +1
          I don't know the background on this, but this seems more consistent, intuitive, and potentially useful (i.e. being able to depend that different field values are anways >n*gap positions apart)

          Show
          Yonik Seeley added a comment - always adds offset gap +1 I don't know the background on this, but this seems more consistent, intuitive, and potentially useful (i.e. being able to depend that different field values are anways >n*gap positions apart)
          Hide
          Koji Sekiguchi added a comment -

          Here is my idea. It is very simple - always adds offset gap. This causes test failure in offset test of TestIndexWriter.
          Can anyone explain why the condition (if(anyToken)) is there?

          Show
          Koji Sekiguchi added a comment - Here is my idea. It is very simple - always adds offset gap. This causes test failure in offset test of TestIndexWriter. Can anyone explain why the condition (if(anyToken)) is there?
          Hide
          Koji Sekiguchi added a comment -

          The test program which can be used to see the problem.

          Show
          Koji Sekiguchi added a comment - The test program which can be used to see the problem.

            People

            • Assignee:
              Koji Sekiguchi
              Reporter:
              Koji Sekiguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development