Lucene - Core
  1. Lucene - Core
  2. LUCENE-1255

CheckIndex should allow term position = -1

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.3.2, 2.4
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from this discussion:

      http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/%3CPine.LNX.4.62.0803292323350.16762@radix.cryptio.net%3E

      Right now CheckIndex claims the index is corrupt if you index a Token with -1 position, which happens if your first token has positionIncrementGap set to 0.

      But, as far as I can tell, Lucene doesn't "mind" when this happens.

      So I plan to fix CheckIndex to allow this case. I'll backport to 2.3.2 as well.

      LUCENE-1253 is one example where Lucene's core analyzers could do this.

      1. LUCENE-1255.patch
        4 kB
        Michael McCandless
      2. LUCENE-1255.take2.patch
        5 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Attached patch. I plan to commit in a day or two.

          I also noticed that CheckIndex would incorrectly mark an index as corrupt if two sequential positions are the same, which of course is perfectly fine (eg if you are injecting synonyms with positionIncrementGap=0). So I fixed that case too.

          Show
          Michael McCandless added a comment - Attached patch. I plan to commit in a day or two. I also noticed that CheckIndex would incorrectly mark an index as corrupt if two sequential positions are the same, which of course is perfectly fine (eg if you are injecting synonyms with positionIncrementGap=0). So I fixed that case too.
          Hide
          Hoss Man added a comment -

          FWIW: Apparently Luke gets confused by this as well.

          if i'm understanding correctly, the only reason position "-1" seems to come about is because of some internal counter that starts at "-1" since it assumes the first token will have an increment of "1" and thus: the first position used is "0"

          Changing IndexWriter/DocumentsWriter/whatever to ignore the increment of the first Token, or make it realitve to "0" might be a better solution in general.

          cases to consider: what should it mean for position info if:

          • first token has increment of 0
          • first token has increment of 1
          • first token has increment of 27

          in my opinion, the answers should probably be: 0, 1, 27 ... but i think that changes the common case of "1" (which would currently gets a position of "0" right?) ... so maybe the right behavior is 0,0,26

          Show
          Hoss Man added a comment - FWIW: Apparently Luke gets confused by this as well. if i'm understanding correctly, the only reason position "-1" seems to come about is because of some internal counter that starts at "-1" since it assumes the first token will have an increment of "1" and thus: the first position used is "0" Changing IndexWriter/DocumentsWriter/whatever to ignore the increment of the first Token, or make it realitve to "0" might be a better solution in general. cases to consider: what should it mean for position info if: first token has increment of 0 first token has increment of 1 first token has increment of 27 in my opinion, the answers should probably be: 0, 1, 27 ... but i think that changes the common case of "1" (which would currently gets a position of "0" right?) ... so maybe the right behavior is 0,0,26
          Hide
          Doug Cutting added a comment -

          Since the increment is relative to the prior token, it should make no difference when there is no prior token. So shouldn't the first token be at position=0 no matter what its increment?

          Show
          Doug Cutting added a comment - Since the increment is relative to the prior token, it should make no difference when there is no prior token. So shouldn't the first token be at position=0 no matter what its increment?
          Hide
          Hoss Man added a comment -

          Since the increment is relative to the prior token, it should make no difference when there is no prior token. So shouldn't the first token be at position=0 no matter what its increment?

          I'm ok with that ... but the other way to look at it is that there is a "virtual token" signifing the start of the Field, and the first physical tokens increment is relative that ... with things like SpanFirstQueries not allowing the "first" token to specify an arbitrary increment from the 'virtual" start of the field has real consequences.

          (consider an Analyzer that only indexes proper names and wants to record real positions so that a SpanFirstQuery can find docs where Doug Cutting appears in the first 100 Tokens)

          Show
          Hoss Man added a comment - Since the increment is relative to the prior token, it should make no difference when there is no prior token. So shouldn't the first token be at position=0 no matter what its increment? I'm ok with that ... but the other way to look at it is that there is a "virtual token" signifing the start of the Field, and the first physical tokens increment is relative that ... with things like SpanFirstQueries not allowing the "first" token to specify an arbitrary increment from the 'virtual" start of the field has real consequences. (consider an Analyzer that only indexes proper names and wants to record real positions so that a SpanFirstQuery can find docs where Doug Cutting appears in the first 100 Tokens)
          Hide
          Doug Cutting added a comment -

          Hoss: you're right, the change I proposed would be incompatible and might make some applications more difficult to write, with little benefit except aesthetic.

          Show
          Doug Cutting added a comment - Hoss: you're right, the change I proposed would be incompatible and might make some applications more difficult to write, with little benefit except aesthetic.
          Hide
          Michael McCandless added a comment -

          OK so I'll change the patch so that the first Token's position is max(0, increment-1).

          This only changes the "increment=0" case to now set position to 0 not -1.

          And I'll change CheckIndex again to reject position == -1.

          Show
          Michael McCandless added a comment - OK so I'll change the patch so that the first Token's position is max(0, increment-1). This only changes the "increment=0" case to now set position to 0 not -1. And I'll change CheckIndex again to reject position == -1.
          Hide
          Michael McCandless added a comment -

          Attached new patch.

          Show
          Michael McCandless added a comment - Attached new patch.
          Hide
          Michael Busch added a comment -

          I don't think this was a backwards-compatible change.

          With this change you can't tell anymore if the first token had a positionIncrement
          set to 0 or 1. TermPositions.nextPosition() returns 0 in both cases. Applications
          today might rely on getting -1 from nextPosition() in the former case.

          Show
          Michael Busch added a comment - I don't think this was a backwards-compatible change. With this change you can't tell anymore if the first token had a positionIncrement set to 0 or 1. TermPositions.nextPosition() returns 0 in both cases. Applications today might rely on getting -1 from nextPosition() in the former case.
          Hide
          Michael McCandless added a comment -

          Hmm, you're right.

          OK so I guess we should revert it entirely, on 2.4 & 2.3.2, and continue to accept -1 position in the index? It doesn't seem to cause any real harm, except CheckIndex was [incorrectly] flagging it as corruption, and, Luke was also unhappy with it.

          Show
          Michael McCandless added a comment - Hmm, you're right. OK so I guess we should revert it entirely, on 2.4 & 2.3.2, and continue to accept -1 position in the index? It doesn't seem to cause any real harm, except CheckIndex was [incorrectly] flagging it as corruption, and, Luke was also unhappy with it.
          Hide
          Michael Busch added a comment -

          OK so I guess we should revert it entirely, on 2.4 & 2.3.2, and continue to accept -1 position in the index?

          Yes, I think we should do that.

          Show
          Michael Busch added a comment - OK so I guess we should revert it entirely, on 2.4 & 2.3.2, and continue to accept -1 position in the index? Yes, I think we should do that.
          Hide
          Mark Miller added a comment -

          why should the first token have an increment of 1? I think the first token should always be 0. Increments are between tokens.

          and why do you want to know if the first token had an inc of 0 or 1? I think the -1 is a bug and it shows up elsewhere - eg spans,payloads, and most likely highlighting with the SpanHighlighter.

          Show
          Mark Miller added a comment - why should the first token have an increment of 1? I think the first token should always be 0. Increments are between tokens. and why do you want to know if the first token had an inc of 0 or 1? I think the -1 is a bug and it shows up elsewhere - eg spans,payloads, and most likely highlighting with the SpanHighlighter.
          Hide
          Uwe Schindler added a comment -

          Fix reverted and corrected to disallow negative posIncr in LUCENE-1542.

          Show
          Uwe Schindler added a comment - Fix reverted and corrected to disallow negative posIncr in LUCENE-1542 .

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development