Lucene - Core
  1. Lucene - Core
  2. LUCENE-2668

offset gap should be added regardless of existence of tokens in DocInverterPerField

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.9.3, 3.0.2, 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Problem: If a multiValued field which contains a stop word (e.g. "will" in the following sample) only value is analyzed by StopAnalyzer when indexing, the offsets of the subsequent tokens are not correct.

      indexing a multiValued field
      doc.add( new Field( F, "Mike", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) );
      doc.add( new Field( F, "will", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) );
      doc.add( new Field( F, "use", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) );
      doc.add( new Field( F, "Lucene", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) );
      

      In this program (soon to be attached), if you use WhitespaceAnalyzer, you'll get the offset(start,end) for "use" and "Lucene" will be use(10,13) and Lucene(14,20). But if you use StopAnalyzer, the offsets will be use(9,12) and lucene(13,19). When searching, since searcher cannot know what analyzer was used at indexing time, this problem causes out of alignment of FVH.

      Cause of the problem: StopAnalyzer filters out "will", anyToken flag set to false then offset gap is not added in DocInverterPerField:

      DocInverterPerField.java
      if (anyToken)
        fieldState.offset += docState.analyzer.getOffsetGap(field);
      

      I don't understand why the condition is there... If always the gap is added, I think things are simple.

      1. Test.java
        3 kB
        Koji Sekiguchi
      2. LUCENE-2668.patch
        1 kB
        Koji Sekiguchi
      3. LUCENE-2668.patch
        11 kB
        Koji Sekiguchi
      4. LUCENE-2668.patch
        13 kB
        Koji Sekiguchi

        Issue Links

          Activity

          Koji Sekiguchi created issue -
          Hide
          Koji Sekiguchi added a comment -

          The test program which can be used to see the problem.

          Show
          Koji Sekiguchi added a comment - The test program which can be used to see the problem.
          Koji Sekiguchi made changes -
          Field Original Value New Value
          Attachment Test.java [ 12455571 ]
          Hide
          Koji Sekiguchi added a comment -

          Here is my idea. It is very simple - always adds offset gap. This causes test failure in offset test of TestIndexWriter.
          Can anyone explain why the condition (if(anyToken)) is there?

          Show
          Koji Sekiguchi added a comment - Here is my idea. It is very simple - always adds offset gap. This causes test failure in offset test of TestIndexWriter. Can anyone explain why the condition (if(anyToken)) is there?
          Koji Sekiguchi made changes -
          Attachment LUCENE-2668.patch [ 12455584 ]
          Hide
          Yonik Seeley added a comment -

          always adds offset gap

          +1
          I don't know the background on this, but this seems more consistent, intuitive, and potentially useful (i.e. being able to depend that different field values are anways >n*gap positions apart)

          Show
          Yonik Seeley added a comment - always adds offset gap +1 I don't know the background on this, but this seems more consistent, intuitive, and potentially useful (i.e. being able to depend that different field values are anways >n*gap positions apart)
          Hide
          Koji Sekiguchi added a comment -

          Thanks Yonik for the comment.

          Here is the final patch. I'd like to commit in a few days if no one objects.

          Show
          Koji Sekiguchi added a comment - Thanks Yonik for the comment. Here is the final patch. I'd like to commit in a few days if no one objects.
          Koji Sekiguchi made changes -
          Attachment LUCENE-2668.patch [ 12455586 ]
          Hide
          Michael McCandless added a comment -

          +1

          But, what about index back compat? Should we switch this under Version? Or do we think apps are not relying on this quirky behavior?

          In the future, eg w/ write-once attr bindings in the analysis chain (LUCENE-2450), which lets us fully decouple analysis and indexing, how pos/offset gaps are added for multi-valued fields will be fully under the analyzer's control...

          Show
          Michael McCandless added a comment - +1 But, what about index back compat? Should we switch this under Version? Or do we think apps are not relying on this quirky behavior? In the future, eg w/ write-once attr bindings in the analysis chain ( LUCENE-2450 ), which lets us fully decouple analysis and indexing, how pos/offset gaps are added for multi-valued fields will be fully under the analyzer's control...
          Hide
          Yonik Seeley added a comment -

          Or do we think apps are not relying on this quirky behavior?

          I'd doubt there's a single one relying on the current behavior - in fact I think it's more likely that there's an app out there relying on the proposed behavior and they just haven't hit the case when no tokens were indexed for a field value.

          Show
          Yonik Seeley added a comment - Or do we think apps are not relying on this quirky behavior? I'd doubt there's a single one relying on the current behavior - in fact I think it's more likely that there's an app out there relying on the proposed behavior and they just haven't hit the case when no tokens were indexed for a field value.
          Hide
          Michael McCandless added a comment -

          I'd doubt there's a single one relying on the current behavior - in fact I think it's more likely that there's an app out there relying on the proposed behavior and they just haven't hit the case when no tokens were indexed for a field value.

          OK so let's just break it and advertise this in the back compat breaks...

          Show
          Michael McCandless added a comment - I'd doubt there's a single one relying on the current behavior - in fact I think it's more likely that there's an app out there relying on the proposed behavior and they just haven't hit the case when no tokens were indexed for a field value. OK so let's just break it and advertise this in the back compat breaks...
          Hide
          Robert Muir added a comment -

          Seems like we should fix LUCENE-2529 for the positions too?

          Show
          Robert Muir added a comment - Seems like we should fix LUCENE-2529 for the positions too?
          Robert Muir made changes -
          Link This issue is related to LUCENE-2529 [ LUCENE-2529 ]
          Hide
          Yonik Seeley added a comment -

          Seems like we should fix LUCENE-2529 for the positions too?

          Yep, really seems like almost the same issue.

          Show
          Yonik Seeley added a comment - Seems like we should fix LUCENE-2529 for the positions too? Yep, really seems like almost the same issue.
          Hide
          Koji Sekiguchi added a comment -

          Ok, I'll see this issue and LUCENE-2529 all together.

          Show
          Koji Sekiguchi added a comment - Ok, I'll see this issue and LUCENE-2529 all together.
          Koji Sekiguchi made changes -
          Assignee Koji Sekiguchi [ koji ]
          Hide
          Koji Sekiguchi added a comment -

          New patch attached. It includes LUCENE-2529 and its test case. I added setter/getter for positionIncrementGap to MockAnalyzer.

          Show
          Koji Sekiguchi added a comment - New patch attached. It includes LUCENE-2529 and its test case. I added setter/getter for positionIncrementGap to MockAnalyzer.
          Koji Sekiguchi made changes -
          Attachment LUCENE-2668.patch [ 12455632 ]
          Hide
          Koji Sekiguchi added a comment -

          trunk: Committed revision 1001796 and 1001957.
          branch_3x: Committed revision 1001991.

          Show
          Koji Sekiguchi added a comment - trunk: Committed revision 1001796 and 1001957. branch_3x: Committed revision 1001991.
          Koji Sekiguchi made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 3.1 [ 12314822 ]
          Fix Version/s 4.0 [ 12314025 ]
          Resolution Fixed [ 1 ]
          Hide
          David Smiley added a comment -

          So to anyone who's commented on this issue that has done work on this class... do you know why it conditionally decrements the position and then increments it later unconditionally? Reference lines 156 & 188. The very fact that it happens sometimes but not others is thwarting my efforts to have the term positions between multiple fields coincide (special purpose use case I have). I'm using a position filter that ensures that all terms for a value have 0 position increment gap, even the first one. But sometimes I have no value or I have a value that is a stop word. My hacky work-around is to set the first value to each of these multi-valued fields be some dummy value that gets indexed. This is ugly and wasteful on disk.

          Show
          David Smiley added a comment - So to anyone who's commented on this issue that has done work on this class... do you know why it conditionally decrements the position and then increments it later unconditionally? Reference lines 156 & 188. The very fact that it happens sometimes but not others is thwarting my efforts to have the term positions between multiple fields coincide (special purpose use case I have). I'm using a position filter that ensures that all terms for a value have 0 position increment gap, even the first one. But sometimes I have no value or I have a value that is a stop word. My hacky work-around is to set the first value to each of these multi-valued fields be some dummy value that gets indexed. This is ugly and wasteful on disk.
          Hide
          Michael McCandless added a comment -

          do you know why it conditionally decrements the position and then increments it later unconditionally?

          This was done in LUCENE-1542, to prevent position increment from becoming negative, which messes up positions if the field has a payload. Basically we cannot store a -1 position if the field has a payload.

          But I agree it's awful that we are inconsistent and it requires hacks for apps that have 0 posIncr on the first token. Not sure how to best fix this though...

          Show
          Michael McCandless added a comment - do you know why it conditionally decrements the position and then increments it later unconditionally? This was done in LUCENE-1542 , to prevent position increment from becoming negative, which messes up positions if the field has a payload. Basically we cannot store a -1 position if the field has a payload. But I agree it's awful that we are inconsistent and it requires hacks for apps that have 0 posIncr on the first token. Not sure how to best fix this though...
          Hide
          David Smiley added a comment -

          Apologies; I meant to post to LUCENE-2529

          Show
          David Smiley added a comment - Apologies; I meant to post to LUCENE-2529
          Mark Thomas made changes -
          Workflow jira [ 12521519 ] Default workflow, editable Closed status [ 12564317 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12564317 ] jira [ 12584847 ]
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1
          Grant Ingersoll made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Koji Sekiguchi
              Reporter:
              Koji Sekiguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development