Lucene - Core
  1. Lucene - Core
  2. LUCENE-3849

position increments should be implemented by TokenStream.end()

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 4.5, Trunk
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      if you have pages of a book as multivalued fields, with the default position increment gap
      of analyzer.java (0), phrase queries won't work across pages if one ends with stopword(s).

      This is because the 'trailing holes' are not taken into account in end(). So I think in
      TokenStream.end(), subclasses of FilteringTokenFilter (e.g. stopfilter) should do:

      super.end();
      posIncAtt += skippedPositions;
      

      One problem is that these filters need to 'add' to the posinc, but currently nothing clears
      the attributes for end() [they are dirty, except offset which is set by the tokenizer].

      Also the indexer should be changed to pull posIncAtt from end().

      1. LUCENE-3849.patch
        7 kB
        Robert Muir
      2. LUCENE-3849.patch
        27 kB
        Robert Muir
      3. LUCENE-3849.patch
        31 kB
        Michael McCandless
      4. LUCENE-3849.patch
        34 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Adrien Grand added a comment -

          4.5 release -> bulk close

          Show
          Adrien Grand added a comment - 4.5 release -> bulk close
          Hide
          ASF subversion and git services added a comment -

          Commit 1516001 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1516001 ]

          LUCENE-3849: fix some more test only TokenStreams

          Show
          ASF subversion and git services added a comment - Commit 1516001 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1516001 ] LUCENE-3849 : fix some more test only TokenStreams
          Hide
          ASF subversion and git services added a comment -

          Commit 1515994 from Michael McCandless in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1515994 ]

          LUCENE-3849: fix some more test only TokenStreams

          Show
          ASF subversion and git services added a comment - Commit 1515994 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1515994 ] LUCENE-3849 : fix some more test only TokenStreams
          Hide
          ASF subversion and git services added a comment -

          Commit 1515909 from Michael McCandless in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1515909 ]

          LUCENE-3849: end() now sets position increment, so any trailing holes are counted

          Show
          ASF subversion and git services added a comment - Commit 1515909 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1515909 ] LUCENE-3849 : end() now sets position increment, so any trailing holes are counted
          Hide
          ASF subversion and git services added a comment -

          Commit 1515887 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1515887 ]

          LUCENE-3849: end() now sets position increment, so any trailing holes are counted

          Show
          ASF subversion and git services added a comment - Commit 1515887 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1515887 ] LUCENE-3849 : end() now sets position increment, so any trailing holes are counted
          Hide
          Michael McCandless added a comment -

          So I think when this issue is committed we should resolve LUCENE-4318.

          OK, will do!

          Show
          Michael McCandless added a comment - So I think when this issue is committed we should resolve LUCENE-4318 . OK, will do!
          Hide
          Robert Muir added a comment -

          That's a good idea on end(); I'll do that and check all impls.

          I did this: i dont think there are any issues (or you fixed them in the patch).

          So I think when this issue is committed we should resolve LUCENE-4318. Thank you!

          Show
          Robert Muir added a comment - That's a good idea on end(); I'll do that and check all impls. I did this: i dont think there are any issues (or you fixed them in the patch). So I think when this issue is committed we should resolve LUCENE-4318 . Thank you!
          Hide
          Michael McCandless added a comment -

          I think because facet module cutover from payloads to DV, many of the problematic TokenStreams disappeared? But there was still one, inside DirectoryTaxonomyWriter, that I fixed in the patch ... it now calls clearAttributes and sets each att on incrementToken.

          That's a good idea on end(); I'll do that and check all impls.

          I don't see a better way than setting posInc to 0 in end ... and I agree this bug is bad. It can also affects suggesters, e.g. if it uses ShingleFilter after StopFilter and the user's last word was a stop word.

          Show
          Michael McCandless added a comment - I think because facet module cutover from payloads to DV, many of the problematic TokenStreams disappeared? But there was still one, inside DirectoryTaxonomyWriter, that I fixed in the patch ... it now calls clearAttributes and sets each att on incrementToken. That's a good idea on end(); I'll do that and check all impls. I don't see a better way than setting posInc to 0 in end ... and I agree this bug is bad. It can also affects suggesters, e.g. if it uses ShingleFilter after StopFilter and the user's last word was a stop word.
          Hide
          Robert Muir added a comment -

          Thanks for bringing this back to life Mike.

          How did you deal with facets? Is this stuff out of date now that it no longer encodes with payloads? It was the big barrier to my previous patch: lots of tokenstreams never calling clearAtts at all (LUCENE-4318)

          If its no longer a problem, lets mark that issue resolved.

          Patch looks good to me, though it would be good to temporarily make end() final or something in TokenStream.java and review all tokenstreams that have an impl to make sure its ok.

          I still have my reservations about the posInc=0 stuff (i wish this was cleaner), but I don't see a better way: and this is really a bug we should fix.

          Show
          Robert Muir added a comment - Thanks for bringing this back to life Mike. How did you deal with facets? Is this stuff out of date now that it no longer encodes with payloads? It was the big barrier to my previous patch: lots of tokenstreams never calling clearAtts at all ( LUCENE-4318 ) If its no longer a problem, lets mark that issue resolved. Patch looks good to me, though it would be good to temporarily make end() final or something in TokenStream.java and review all tokenstreams that have an impl to make sure its ok. I still have my reservations about the posInc=0 stuff (i wish this was cleaner), but I don't see a better way: and this is really a bug we should fix.
          Hide
          Michael McCandless added a comment -

          New patch, resolving the nocommit and adding a test.

          I think it's ready.

          Show
          Michael McCandless added a comment - New patch, resolving the nocommit and adding a test. I think it's ready.
          Hide
          Michael McCandless added a comment -

          New patch, syncing up to trunk, and fixing various tokenizers that had bugs ... tests pass.

          Show
          Michael McCandless added a comment - New patch, syncing up to trunk, and fixing various tokenizers that had bugs ... tests pass.
          Hide
          Robert Muir added a comment -

          here is an updated patch, fixing all the analyzers and also removing the -1.

          Additionally i fixed basetokenstreamtestcase to apply the checkClearAttributes etc before end() too, and fixed things like StandardTokenizer that sometimes act like stopfilters (remove too-long tokens) to apply these in end(), in case they happen to appear at the end of a multivalued field.

          There are unrelated problems in facets/ it has lots of tokenstreams that never call clearAttributes. I will open an issue to clean that up, its not possible to proceed until those tokenstreams are fixed.

          Show
          Robert Muir added a comment - here is an updated patch, fixing all the analyzers and also removing the -1. Additionally i fixed basetokenstreamtestcase to apply the checkClearAttributes etc before end() too, and fixed things like StandardTokenizer that sometimes act like stopfilters (remove too-long tokens) to apply these in end(), in case they happen to appear at the end of a multivalued field. There are unrelated problems in facets/ it has lots of tokenstreams that never call clearAttributes. I will open an issue to clean that up, its not possible to proceed until those tokenstreams are fixed.
          Hide
          Robert Muir added a comment -

          where does the "positionIncrementGap" come in? because my naive impression is that the end() method is precisely where something like posIncAtt.setPositionIncrement(getPositionIncrementaGap()) should be called.

          PositionIncrementGap is not related to this bug, because its something done separately by indexwriter. Its always factored in correctly, its not buggy.

          The "bug" is if a field instance ends with a stopword, that accumulated 'hole' is lost.

          doc.add(new TextField("body", "just a", Field.Store.NO));
          doc.add(new TextField("body", "test of gaps", Field.Store.NO));
          iw.addDocument(doc);
          ...
          PhraseQuery pq = new PhraseQuery();
          pq.add(new Term("body", "just"), 0);
          pq.add(new Term("body", "test"), 2);
          // body:"just ? test"
          assertEquals(1, is.search(pq, 5).totalHits); // FAIL!
          

          So the problem is the first instance of the field loses its hole for "a" (a stopword that was removed),
          only because it ended on a stopword, so incrementToken() returned false and there was no subsequent token
          to apply the 'hole' to.

          This is the same problem as an instance of a field ending with say a space character and all the offsets being wrong,
          this was why end() was added, so the indexer can pull 'end of stream state'.

          Show
          Robert Muir added a comment - where does the "positionIncrementGap" come in? because my naive impression is that the end() method is precisely where something like posIncAtt.setPositionIncrement(getPositionIncrementaGap()) should be called. PositionIncrementGap is not related to this bug, because its something done separately by indexwriter. Its always factored in correctly, its not buggy. The "bug" is if a field instance ends with a stopword, that accumulated 'hole' is lost. doc.add(new TextField("body", "just a", Field.Store.NO)); doc.add(new TextField("body", "test of gaps", Field.Store.NO)); iw.addDocument(doc); ... PhraseQuery pq = new PhraseQuery(); pq.add(new Term("body", "just"), 0); pq.add(new Term("body", "test"), 2); // body:"just ? test" assertEquals(1, is.search(pq, 5).totalHits); // FAIL! So the problem is the first instance of the field loses its hole for "a" (a stopword that was removed), only because it ended on a stopword, so incrementToken() returned false and there was no subsequent token to apply the 'hole' to. This is the same problem as an instance of a field ending with say a space character and all the offsets being wrong, this was why end() was added, so the indexer can pull 'end of stream state'.
          Hide
          Hoss Man added a comment -

          should we do something like clearAttributes() + posIncAtt.setPositionIncrement(0) ?

          where does the "positionIncrementGap" come in? because my naive impression is that the end() method is precisely where something like posIncAtt.setPositionIncrement(getPositionIncrementaGap()) should be called.

          Show
          Hoss Man added a comment - should we do something like clearAttributes() + posIncAtt.setPositionIncrement(0) ? where does the "positionIncrementGap" come in? because my naive impression is that the end() method is precisely where something like posIncAtt.setPositionIncrement(getPositionIncrementaGap()) should be called.
          Hide
          Robert Muir added a comment -

          In addition we should also check all Tokenizers to set the the correct endOffset (end of stream).

          This is checked by BaseTokenStream test already. its just that currently offsetAtt is the only thing that we consume from end(), and all tokenizers effectively "overwrite" it with the correct values. So analyzers tests already pass, the only buggy one was the built-in keywordtokenizer in StringField.

          Show
          Robert Muir added a comment - In addition we should also check all Tokenizers to set the the correct endOffset (end of stream). This is checked by BaseTokenStream test already. its just that currently offsetAtt is the only thing that we consume from end(), and all tokenizers effectively "overwrite" it with the correct values. So analyzers tests already pass, the only buggy one was the built-in keywordtokenizer in StringField.
          Hide
          Robert Muir added a comment -

          What about the "off-by-one" though? Because the problem is a full clearAttributes (which seems correct), sets the posIncAtt to a default of 1.

          This makes consuming the tokenstream awkward because you have to subtract 1 in this case. should we do something like clearAttributes() + posIncAtt.setPositionIncrement(0) ?

          Show
          Robert Muir added a comment - What about the "off-by-one" though? Because the problem is a full clearAttributes (which seems correct), sets the posIncAtt to a default of 1. This makes consuming the tokenstream awkward because you have to subtract 1 in this case. should we do something like clearAttributes() + posIncAtt.setPositionIncrement(0) ?
          Hide
          Uwe Schindler added a comment -

          +1 to call a full clearAttributes(). After calling end(), we have a new "state" of the TokenStream and havin e.g. the CharTermAttribute or other atts. In addition we should also check all Tokenizers to set the the correct endOffset (end of stream).

          Show
          Uwe Schindler added a comment - +1 to call a full clearAttributes(). After calling end(), we have a new "state" of the TokenStream and havin e.g. the CharTermAttribute or other atts. In addition we should also check all Tokenizers to set the the correct endOffset (end of stream).
          Hide
          Robert Muir added a comment -

          Here's a patch. There are some things I don't like about it though.

          Again to explain the situation:

          • buggy today, we call end() and then read a dirty offsetAttribute. nothing clears atts before end().
          • we need to support multiple removers in the chain, each applying their posInc just like they do in incrementToken
          • because of that the atts cannot be dirty.

          So in this patch, i just call clearAttributes() in tokenstream.end() by default instead of doing nothing. this works, except that means when IW consumes this 'final posinc' there is an OB1, because posIncAtt's default value is 1. I don't like that.

          alternatively we could have tokenstream explicitly set posIncrAtt to 0 in end() instead of clearAttributes()? I'm just wondering if thats any better really.

          Otherwise the patch is straightforward, with the exception of IW's built-in keywordtokenizer (StringField.java). that one is not actually setting end(), we were just relying upon dirty atts, so thats why i changed it.

          Show
          Robert Muir added a comment - Here's a patch. There are some things I don't like about it though. Again to explain the situation: buggy today, we call end() and then read a dirty offsetAttribute. nothing clears atts before end(). we need to support multiple removers in the chain, each applying their posInc just like they do in incrementToken because of that the atts cannot be dirty. So in this patch, i just call clearAttributes() in tokenstream.end() by default instead of doing nothing. this works, except that means when IW consumes this 'final posinc' there is an OB1, because posIncAtt's default value is 1. I don't like that. alternatively we could have tokenstream explicitly set posIncrAtt to 0 in end() instead of clearAttributes()? I'm just wondering if thats any better really. Otherwise the patch is straightforward, with the exception of IW's built-in keywordtokenizer (StringField.java). that one is not actually setting end(), we were just relying upon dirty atts, so thats why i changed it.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Robert Muir
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development