Lucene - Core
  1. Lucene - Core
  2. LUCENE-1542

Lucene can incorrectly set the position of tokens that start a field with positonInc 0.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      More info in LUCENE-1465

      1. LUCENE-1542.patch
        0.7 kB
        Mark Miller
      2. LUCENE-1542.patch
        3 kB
        Mark Miller
      3. LUCENE-1542.patch
        13 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          OK I plan to commit the current patch shortly. I'll add an entry under "Changes in runtime behavior" explaining the change...

          Show
          Michael McCandless added a comment - OK I plan to commit the current patch shortly. I'll add an entry under "Changes in runtime behavior" explaining the change...
          Hide
          Shai Erera added a comment -

          I don't think there's a user-visible impact. This happens only if you set posIncr=0 for the very first token. I guess it's not a common thing (if at all), which is why we haven't heard of it?
          I was worried though that if I have an index with -1s already encoded, and I don't call setAllowMinus1Position, then some of the positions will be -1 and some 0. And since it is deprecated, and will be removed in 3.0, I'll need to reindex in 3.0.

          But I agree that a bug fix should not be carried into the internal Lucene processes. You should reindex.

          Show
          Shai Erera added a comment - I don't think there's a user-visible impact. This happens only if you set posIncr=0 for the very first token. I guess it's not a common thing (if at all), which is why we haven't heard of it? I was worried though that if I have an index with -1s already encoded, and I don't call setAllowMinus1Position, then some of the positions will be -1 and some 0. And since it is deprecated, and will be removed in 3.0, I'll need to reindex in 3.0. But I agree that a bug fix should not be carried into the internal Lucene processes. You should reindex.
          Hide
          Yonik Seeley added a comment -

          Most bug fixes aren't back compatible
          What's the real user-visible impact of this fix to someone who hasn't re-indexed?

          Show
          Yonik Seeley added a comment - Most bug fixes aren't back compatible What's the real user-visible impact of this fix to someone who hasn't re-indexed?
          Hide
          Shai Erera added a comment -

          I absolutely agree this is a bug that should be fixed. I was just worried with keeping the bug there, and forcing the app to reindex. I thought that can be avoided if we fix it under the cover, whenever merges occur. But I may be wrong.

          Maybe when this new "segment-level metadata" comes in, we could have written some code which reads a Segment and based on its version fixes the positions.

          Oh well .. it's just one more case where the app would need to reindex due to a bug fix (the other case I'm aware of is the invalid acronyms). I suppose that's acceptable, since it's a bug fix.

          Show
          Shai Erera added a comment - I absolutely agree this is a bug that should be fixed. I was just worried with keeping the bug there, and forcing the app to reindex. I thought that can be avoided if we fix it under the cover, whenever merges occur. But I may be wrong. Maybe when this new "segment-level metadata" comes in, we could have written some code which reads a Segment and based on its version fixes the positions. Oh well .. it's just one more case where the app would need to reindex due to a bug fix (the other case I'm aware of is the invalid acronyms). I suppose that's acceptable, since it's a bug fix.
          Hide
          Michael McCandless added a comment -

          My question is - when will those -1 positions be fixed?

          I think the app must decide that? I don't think we should correct it
          during merging, since that'd sneakily change your index whenever
          merges complete?

          We could leave this deprecated "keep the bug" method around until 4.0?
          This way you'd have until 4.0 to reindex.

          I think this breaks back-compat

          Right, my patch breaks back compat, but I think this bug warrants an
          exception.

          This is a bad bad bug. Not only does it corrupt your positions
          (storing Int.MAX_VALUE instead of -1, and then storing the next
          position as Int.MIN_VALUE), it also can allow that corruption to
          spread as segments are merged (if those other segments didn't have
          docs w/ payloads). And, it causes Span*Query to return the wrong
          results in some cases.

          I think new users shouldn't have to wait until 4.0 to see this bug
          fixed?

          I suppose an alternate approach would be to leave the -1 bug in place,
          and only fix the case when there are payloads. It'd be messy. I
          think we'd have to fix SegmentTermPositions to add an "if (firstTime
          && pos==Integer.MAX_VALUE)" to rewire it back to -1. If we did this
          we'd be back to Lucene's "oddity". It's not great because it's a perf
          cost on the search side...

          Show
          Michael McCandless added a comment - My question is - when will those -1 positions be fixed? I think the app must decide that? I don't think we should correct it during merging, since that'd sneakily change your index whenever merges complete? We could leave this deprecated "keep the bug" method around until 4.0? This way you'd have until 4.0 to reindex. I think this breaks back-compat Right, my patch breaks back compat, but I think this bug warrants an exception. This is a bad bad bug. Not only does it corrupt your positions (storing Int.MAX_VALUE instead of -1, and then storing the next position as Int.MIN_VALUE), it also can allow that corruption to spread as segments are merged (if those other segments didn't have docs w/ payloads). And, it causes Span*Query to return the wrong results in some cases. I think new users shouldn't have to wait until 4.0 to see this bug fixed? I suppose an alternate approach would be to leave the -1 bug in place, and only fix the case when there are payloads. It'd be messy. I think we'd have to fix SegmentTermPositions to add an "if (firstTime && pos==Integer.MAX_VALUE)" to rewire it back to -1. If we did this we'd be back to Lucene's "oddity". It's not great because it's a perf cost on the search side...
          Hide
          Shai Erera added a comment -

          So Mike, just for clarification - let's say I have an index with -1 encoded for positions, already. I then upgrade to 2.9 and realize I should set this on IndexWriter, so in fact more -1 positions will be encoded.

          My question is - when will those -1 positions be fixed? I think this breaks back-compat since it changes the indexed data, and should be handled just like any other indexed data/format changes - i.e., last until 4.0. In the meantime, we can make sure that when segments are merged, or the index is optimized, or whatever else we do to support those back-compat issues, we fix those encodings, so that hopefully by 4.0 my indexes don't contain the -1s anymore (if they do, then I'm screwed and can choose between not upgrading to 4.0, or rebuild them).

          If I'm right, then you don't need this deprecated method, and make the changes under the covers?

          If I'm wrong, and our back-compat policy only covers index format changes, then I will already need to rebuild my indexes, so why wait until 3.0? Basically this is one of the cases that were discussed recently on the back-compat policy thread - a change to indexed data. One that we did not agree on (I vaguely remember we said it should be handled like index format changes, but I may be wrong).

          Show
          Shai Erera added a comment - So Mike, just for clarification - let's say I have an index with -1 encoded for positions, already. I then upgrade to 2.9 and realize I should set this on IndexWriter, so in fact more -1 positions will be encoded. My question is - when will those -1 positions be fixed? I think this breaks back-compat since it changes the indexed data, and should be handled just like any other indexed data/format changes - i.e., last until 4.0. In the meantime, we can make sure that when segments are merged, or the index is optimized, or whatever else we do to support those back-compat issues, we fix those encodings, so that hopefully by 4.0 my indexes don't contain the -1s anymore (if they do, then I'm screwed and can choose between not upgrading to 4.0, or rebuild them). If I'm right, then you don't need this deprecated method, and make the changes under the covers? If I'm wrong, and our back-compat policy only covers index format changes, then I will already need to rebuild my indexes, so why wait until 3.0? Basically this is one of the cases that were discussed recently on the back-compat policy thread - a change to indexed data. One that we did not agree on (I vaguely remember we said it should be handled like index format changes, but I may be wrong).
          Hide
          Michael McCandless added a comment -

          New patch attached, that merges in Mark's test & fix, and the original
          spans test (converted to a unit test) from LUCENE-1465, and adds
          a method to IndexWriter to emulate the buggy behavior.

          Previously, in LUCENE-1255, this problem was just an "oddity" that
          Lucene would record position -1 for the first token(s) if those tokens
          all have position incrment 0. We started to fix it, realized it
          breaks back-compat, and reverted it (accepting the "oddity").

          Now, for this issue we are realizing the problem is much worse if a
          payload happens to be attached to such tokens: instead of -1, the
          position now comes back as Integer.MAX_VALUE (a side effect of how
          payloads are stored in the index, which require that position delta be
          non-zero), which then messes up *SpanQuery and I'm sure other things.
          Subsequent tokens (once posIncr is > 0) then overflow int, and switch
          to MIN_VALUE.

          I think this is a real and nasty bug, and we should fix it, despite
          back-compat.

          So in the patch, I've added deprecated
          IndexWriter.setAllowMinus1Postion() to get back to the buggy
          behaviour, if for some reason an application needs this, and then
          fixed the bug by default.

          Show
          Michael McCandless added a comment - New patch attached, that merges in Mark's test & fix, and the original spans test (converted to a unit test) from LUCENE-1465 , and adds a method to IndexWriter to emulate the buggy behavior. Previously, in LUCENE-1255 , this problem was just an "oddity" that Lucene would record position -1 for the first token(s) if those tokens all have position incrment 0. We started to fix it, realized it breaks back-compat, and reverted it (accepting the "oddity"). Now, for this issue we are realizing the problem is much worse if a payload happens to be attached to such tokens: instead of -1, the position now comes back as Integer.MAX_VALUE (a side effect of how payloads are stored in the index, which require that position delta be non-zero), which then messes up *SpanQuery and I'm sure other things. Subsequent tokens (once posIncr is > 0) then overflow int, and switch to MIN_VALUE. I think this is a real and nasty bug, and we should fix it, despite back-compat. So in the patch, I've added deprecated IndexWriter.setAllowMinus1Postion() to get back to the buggy behaviour, if for some reason an application needs this, and then fixed the bug by default.
          Hide
          Shai Erera added a comment -

          Just wanted to say we've had an internal discussion at work about it, when we wanted to utilize the positions to encode integers, and found out that we always need to increment the position returned by 1 (i.e., if you set posIncr to 5, the position you get when iterating on the positions is 4), and we specifically did not understand why when you set the posIncr to 0 for the first position, Lucene writes a -1. (Well, we understood why it happens, but didn't understand the reason).

          So whatever you do here, I'm glad this issue was opened.

          We figured that the right solution on our side, w/o changing the Lucene code, is to not set the posIncr for the first position, but do so from the 2nd forward. Maybe that's what we need to do in Lucene? I.e. if posIncr is 0 for the first position, we don't decrement by 1?

          Show
          Shai Erera added a comment - Just wanted to say we've had an internal discussion at work about it, when we wanted to utilize the positions to encode integers, and found out that we always need to increment the position returned by 1 (i.e., if you set posIncr to 5, the position you get when iterating on the positions is 4), and we specifically did not understand why when you set the posIncr to 0 for the first position, Lucene writes a -1. (Well, we understood why it happens, but didn't understand the reason). So whatever you do here, I'm glad this issue was opened. We figured that the right solution on our side, w/o changing the Lucene code, is to not set the posIncr for the first position, but do so from the 2nd forward. Maybe that's what we need to do in Lucene? I.e. if posIncr is 0 for the first position, we don't decrement by 1?
          Hide
          Mark Miller added a comment -

          I don't think the fix here needs to disallow -1, but I think it must put the tokens at the right positions, and that is not -1.

          Show
          Mark Miller added a comment - I don't think the fix here needs to disallow -1, but I think it must put the tokens at the right positions, and that is not -1.
          Hide
          Mark Miller added a comment -

          and Spans. if its its included in a span, it will think the span starts ends at 1/+1 without payload it looks and with payloads +/ 2147483647 - or something to that effect.

          Really, anything that counts on the position of the term is going to be screwed I think.

          Show
          Mark Miller added a comment - and Spans. if its its included in a span, it will think the span starts ends at 1/+1 without payload it looks and with payloads +/ 2147483647 - or something to that effect. Really, anything that counts on the position of the term is going to be screwed I think.
          Hide
          Mark Miller added a comment -

          with unit test

          Show
          Mark Miller added a comment - with unit test
          Hide
          Michael McCandless added a comment -

          Alas, this looks like a dup of LUCENE-1255, where we at first did a fix (like this one) but then decided it was not back-compatible and so reverted it.

          However, if that first token (with posIncr=0) also has a payload, it appears to be particularly disastrous, since the way we encode a payload (by left-shifting the position delta by 1 bit) does not preserve the -1, right?

          Show
          Michael McCandless added a comment - Alas, this looks like a dup of LUCENE-1255 , where we at first did a fix (like this one) but then decided it was not back-compatible and so reverted it. However, if that first token (with posIncr=0) also has a payload, it appears to be particularly disastrous, since the way we encode a payload (by left-shifting the position delta by 1 bit) does not preserve the -1, right?
          Hide
          Mark Miller added a comment -

          so that appears to fix it - but i'm not sure thats the right fix. have to look closer at why we do the -1, and then sometimes do 0 - 1 for a position. odd.

          Show
          Mark Miller added a comment - so that appears to fix it - but i'm not sure thats the right fix. have to look closer at why we do the -1, and then sometimes do 0 - 1 for a position. odd.
          Hide
          Mark Miller added a comment -

          something like this to fix

          Show
          Mark Miller added a comment - something like this to fix
          Hide
          Jonathan Mamou added a comment -

          I think that the bug is not related to payload and to the fact that terms at located at the same position.
          It seems to occur only for the first term of the document, if its positionIncrement is equal to 0. In this case, the position of the first term will be wrong: -1 if there is no payload, and 2147483647 if there is a payload.

          Show
          Jonathan Mamou added a comment - I think that the bug is not related to payload and to the fact that terms at located at the same position. It seems to occur only for the first term of the document, if its positionIncrement is equal to 0. In this case, the position of the first term will be wrong: -1 if there is no payload, and 2147483647 if there is a payload.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development