Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.

      This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.

      But this logic is overly simplistic. For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small. Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.

      To fix this, I'd like to add a new getFinalOffset() to TokenStream. I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.

      This has come up several times on the user's list.

      1. lucene-1448.patch
        23 kB
        Michael Busch
      2. lucene-1448.patch
        23 kB
        Michael Busch
      3. LUCENE-1448.patch
        22 kB
        Michael McCandless
      4. LUCENE-1448.patch
        15 kB
        Michael McCandless
      5. LUCENE-1448.patch
        8 kB
        Mark Miller
      6. LUCENE-1448.patch
        7 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          First cut patch. It's not ready, though.

          First: this patch only addresses the final offset, but shouldn't we also address the final position? Eg if StopFilter removes the last few tokens, shouldn't we make it possible to report those skipped positonIncrements? Note that Analyzer does provide a getPositionIncrementGap(String fieldName), which could be used as a workaround if we don't do this. This does seem less important so I'd be OK with doing nothing about positions for now, but I'd like to hear other's takes.

          Second: I can't figure out how to ask StandardTokenizerImpl for the number of chars it pulled from the Reader. Can someone (who understands JFlex well) help out here?

          Third: I haven't done all Tokenizers; eg, SinkTokenizer will need a new ctor so that it's told the final offset. And I'd like to fix all contrib tokenizers too.

          Fourth: I'm nervous about an off-by-one error here. Here's the diff for DocInverterPerField:

          @@ -161,7 +161,13 @@
                           break;
                         }
                       }
          -            fieldState.offset = offsetEnd+1;
          +
          +            final int finalOffset = stream.getFinalOffset();
          +            if (finalOffset == -1)
          +              fieldState.offset = offsetEnd+1;
          +            else
          +              fieldState.offset += finalOffset;
          +
                     } finally {
                       stream.close();
                     }
          

          What makes me nervous is: our current logic is to set the base for future instances of the same field to the last endOffset, plus 1. Why do we have that plus 1? Logically, I think it should be as if you had simply concatenated the strings together, which would not add an extra +1? So, I'm not adding +1 in the getFinalOffset()'s that I've added, but I'm hoping someone can explain why we are adding +1 currently.

          Note that endOffset is normally "exclusive" while startOffset is "inclusive". Ie if a String starts with "abcd " then WhitespaceTokenizer would report a startOffset of 0 and endOffset of 4 for that first token.

          Show
          Michael McCandless added a comment - First cut patch. It's not ready, though. First: this patch only addresses the final offset, but shouldn't we also address the final position? Eg if StopFilter removes the last few tokens, shouldn't we make it possible to report those skipped positonIncrements? Note that Analyzer does provide a getPositionIncrementGap(String fieldName), which could be used as a workaround if we don't do this. This does seem less important so I'd be OK with doing nothing about positions for now, but I'd like to hear other's takes. Second: I can't figure out how to ask StandardTokenizerImpl for the number of chars it pulled from the Reader. Can someone (who understands JFlex well) help out here? Third: I haven't done all Tokenizers; eg, SinkTokenizer will need a new ctor so that it's told the final offset. And I'd like to fix all contrib tokenizers too. Fourth: I'm nervous about an off-by-one error here. Here's the diff for DocInverterPerField: @@ -161,7 +161,13 @@ break ; } } - fieldState.offset = offsetEnd+1; + + final int finalOffset = stream.getFinalOffset(); + if (finalOffset == -1) + fieldState.offset = offsetEnd+1; + else + fieldState.offset += finalOffset; + } finally { stream.close(); } What makes me nervous is: our current logic is to set the base for future instances of the same field to the last endOffset, plus 1. Why do we have that plus 1? Logically, I think it should be as if you had simply concatenated the strings together, which would not add an extra +1? So, I'm not adding +1 in the getFinalOffset()'s that I've added, but I'm hoping someone can explain why we are adding +1 currently. Note that endOffset is normally "exclusive" while startOffset is "inclusive". Ie if a String starts with "abcd " then WhitespaceTokenizer would report a startOffset of 0 and endOffset of 4 for that first token.
          Hide
          Mark Miller added a comment - - edited

          You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'.

          What I can't figure it out is how exactly these offsets are supposed to match up...abcd has offsets of s:0 e:4, which seems to imply it thinks abcd is 5 chars or the end is one greater than the end index (like with spans). In either case, it seems even if you put back the +1, the endoffsets are off somehow, because some will have an end of +1 the end index, while secondary multi-fields will have an end equal to the end index.

          Would be cool to have fixed as this also stymies highlighting with multi-fields.

          edit

          I see. You need that +1 you took out and you need fields after the first to have +1 more for an end offset. Looks they are supposed to be end index +1.

          edit2

          Nm I am a bad counter. I think you only need the +1 back.

          Show
          Mark Miller added a comment - - edited You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'. What I can't figure it out is how exactly these offsets are supposed to match up...abcd has offsets of s:0 e:4, which seems to imply it thinks abcd is 5 chars or the end is one greater than the end index (like with spans). In either case, it seems even if you put back the +1, the endoffsets are off somehow, because some will have an end of +1 the end index, while secondary multi-fields will have an end equal to the end index. Would be cool to have fixed as this also stymies highlighting with multi-fields. edit I see. You need that +1 you took out and you need fields after the first to have +1 more for an end offset. Looks they are supposed to be end index +1. edit2 Nm I am a bad counter. I think you only need the +1 back.
          Hide
          Mark Miller added a comment -

          Second: I can't figure out how to ask StandardTokenizerImpl for the number of chars it pulled from the Reader. Can someone (who understands JFlex well) help out here?

          Whats wrong with?

          <code>

          public int getFinalOffset()

          { return scanner.yychar() + scanner.yylength(); }

          </code>

          Show
          Mark Miller added a comment - Second: I can't figure out how to ask StandardTokenizerImpl for the number of chars it pulled from the Reader. Can someone (who understands JFlex well) help out here? Whats wrong with? <code> public int getFinalOffset() { return scanner.yychar() + scanner.yylength(); } </code>
          Hide
          Mark Miller added a comment - - edited

          Your tests had an off by error.

          I corrected them and added the standardanalyzer piece so that both tests pass. (i didnt correctly put the SA piece in the jflex file)

          Show
          Mark Miller added a comment - - edited Your tests had an off by error. I corrected them and added the standardanalyzer piece so that both tests pass. (i didnt correctly put the SA piece in the jflex file)
          Hide
          Michael McCandless added a comment -

          Attached new patch (changes described below):

          You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'.

          So logically it's like we silently & forcefully insert a space between
          the Fieldable instances?

          Maybe we should add Analyzer.getOffsetGap(String fieldName), which by
          default would return 1, and we then add that into the offset for
          subsequent field instances?

          But then here's another challenge: for NOT_ANALYZED fields we don't
          add this extra +1. We just add the string length. Hmm.

          OK I added Analyzer.getOffsetGap(Fieldable), and defaulted it to
          return 1 for analyzed fields and 0 for unanalyzed fields.

          What's wrong with public int getFinalOffset() { return scanner.yychar() + scanner.yylength(); }

          Does that handle spaces at the end of the text? (Oh it seems like it
          does...I added a test case...hmm).

          i didnt correctly put the SA piece in the jflex file

          I think this change (adding getFinalOffset to StandardTokenizer)
          doesn't need a change to jflex? (It's only if you edit
          StandardTokenizerImpl.java).

          Hmm another complexity is handling a field instance that produced no
          tokens. Currently, we do not increment the cumulative offset by +1 in
          such cases. But, for position increment gap we always add this gap in
          between fields if any field from the past have produced a token. I
          added a couple test cases for this.

          Also, I fixed a bug in how CharTokenizer was computing its final
          offset.

          Still todo:

          • add test cases to cover NOT_ANALYZED fields
          • fix contrib tokenizers to implement getFinalOffset
          Show
          Michael McCandless added a comment - Attached new patch (changes described below): You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'. So logically it's like we silently & forcefully insert a space between the Fieldable instances? Maybe we should add Analyzer.getOffsetGap(String fieldName), which by default would return 1, and we then add that into the offset for subsequent field instances? But then here's another challenge: for NOT_ANALYZED fields we don't add this extra +1. We just add the string length. Hmm. OK I added Analyzer.getOffsetGap(Fieldable), and defaulted it to return 1 for analyzed fields and 0 for unanalyzed fields. What's wrong with public int getFinalOffset() { return scanner.yychar() + scanner.yylength(); } Does that handle spaces at the end of the text? (Oh it seems like it does...I added a test case...hmm). i didnt correctly put the SA piece in the jflex file I think this change (adding getFinalOffset to StandardTokenizer) doesn't need a change to jflex? (It's only if you edit StandardTokenizerImpl.java). Hmm another complexity is handling a field instance that produced no tokens. Currently, we do not increment the cumulative offset by +1 in such cases. But, for position increment gap we always add this gap in between fields if any field from the past have produced a token. I added a couple test cases for this. Also, I fixed a bug in how CharTokenizer was computing its final offset. Still todo: add test cases to cover NOT_ANALYZED fields fix contrib tokenizers to implement getFinalOffset
          Hide
          Michael McCandless added a comment -

          OK I added test coverage for NOT_ANAlYZED fields, and fixed contrib TokenStreams to implement getFinalOffset.

          All tests, and back-compat tests, pass.

          I think it's ready to commit... I'll wait a day or two.

          (Michael: sorry, I think this will cause some conflicts with the new TokenStream API changes...).

          Show
          Michael McCandless added a comment - OK I added test coverage for NOT_ANAlYZED fields, and fixed contrib TokenStreams to implement getFinalOffset. All tests, and back-compat tests, pass. I think it's ready to commit... I'll wait a day or two. (Michael: sorry, I think this will cause some conflicts with the new TokenStream API changes...).
          Hide
          Michael Busch added a comment -

          First: this patch only addresses the final offset, but shouldn't we also address the final position? Eg if StopFilter removes the last few tokens, shouldn't we make it possible to report those skipped positonIncrements?

          Hmm now that we have getPositionIncrementGap() and getOffsetGap(), I think it would make sense to also add getFinalPositionIncrement()?

          To fix this, I'd like to add a new getFinalOffset() to TokenStream.

          Could we add this as Attributes using the new API? FinalOffsetAttribute and FinalPositionIncrementAttribute?

          Show
          Michael Busch added a comment - First: this patch only addresses the final offset, but shouldn't we also address the final position? Eg if StopFilter removes the last few tokens, shouldn't we make it possible to report those skipped positonIncrements? Hmm now that we have getPositionIncrementGap() and getOffsetGap(), I think it would make sense to also add getFinalPositionIncrement()? To fix this, I'd like to add a new getFinalOffset() to TokenStream. Could we add this as Attributes using the new API? FinalOffsetAttribute and FinalPositionIncrementAttribute?
          Hide
          Michael McCandless added a comment -

          Hmm now that we have getPositionIncrementGap() and getOffsetGap(), I think it would make sense to also add getFinalPositionIncrement()?

          We could do that. But how would you implement it? EG StopFilter skips tokens, and (if enabled) already tracks the skippedPositions, so it could return that PLUS whatever its input reports as its getFinalPositionIncrement, I guess?

          Could we add this as Attributes using the new API? FinalOffsetAttribute and FinalPositionIncrementAttribute?

          Hmm we could do that... but it seems awkward to add new attributes that apply only to ending state of the tokenizer.

          I wonder if instead, w/ the new API, we could simply allow querying of certain attributes (offset, posincr) after incrementToken returns "false"?

          Why don't you commit the new TokenStream API first, and we can iterate on this issue & commit 2nd?

          Show
          Michael McCandless added a comment - Hmm now that we have getPositionIncrementGap() and getOffsetGap(), I think it would make sense to also add getFinalPositionIncrement()? We could do that. But how would you implement it? EG StopFilter skips tokens, and (if enabled) already tracks the skippedPositions, so it could return that PLUS whatever its input reports as its getFinalPositionIncrement, I guess? Could we add this as Attributes using the new API? FinalOffsetAttribute and FinalPositionIncrementAttribute? Hmm we could do that... but it seems awkward to add new attributes that apply only to ending state of the tokenizer. I wonder if instead, w/ the new API, we could simply allow querying of certain attributes (offset, posincr) after incrementToken returns "false"? Why don't you commit the new TokenStream API first, and we can iterate on this issue & commit 2nd?
          Hide
          Michael Busch added a comment -

          Hmm we could do that... but it seems awkward to add new attributes that apply only to ending state of the tokenizer.

          Yeah. Also you wouldn't want to pay overhead in TokenFilters that can buffer tokens to serialize or clone those attributes for every token.

          I wonder if instead, w/ the new API, we could simply allow querying of certain attributes (offset, posincr) after incrementToken returns "false"?

          Yeah, maybe we can make the AttributeSource more sophisticated, so that it can distinguish between per-field (instance) and per-token attributes. But as a separate patch, not as part of LUCENE-1422.

          Why don't you commit the new TokenStream API first, and we can iterate on this issue & commit 2nd?

          OK, will do. I think 1422 is ready now.

          Show
          Michael Busch added a comment - Hmm we could do that... but it seems awkward to add new attributes that apply only to ending state of the tokenizer. Yeah. Also you wouldn't want to pay overhead in TokenFilters that can buffer tokens to serialize or clone those attributes for every token. I wonder if instead, w/ the new API, we could simply allow querying of certain attributes (offset, posincr) after incrementToken returns "false"? Yeah, maybe we can make the AttributeSource more sophisticated, so that it can distinguish between per-field (instance) and per-token attributes. But as a separate patch, not as part of LUCENE-1422 . Why don't you commit the new TokenStream API first, and we can iterate on this issue & commit 2nd? OK, will do. I think 1422 is ready now.
          Hide
          Michael McCandless added a comment -

          I'm torn on how to add getFinalOffset()/getFinalPositionIncrement().

          One option is to add a set/getFinalOffset to OffsetAttribute. The
          downside here is it's another int added to that class, that gets
          copied for caching streams yet is only used at the very end.

          Another option is to "define" the API such that when incrementToken()
          returns false, then it has actually advanced to an "end-of-stream
          token". OffsetAttribute.getEndOffset() should return the final
          offset. Since we have not released the new API, we could simply make
          this change (and fix all instances in the core/contrib that use the
          new API accordingly). I think I like this option best.

          Yet another option is to open up "per stream" attrs rather than "per
          token attrs". This seems like alot of added complexity. Are there
          other things, besides these two, that would be an example of a "per
          stream" attr?

          Show
          Michael McCandless added a comment - I'm torn on how to add getFinalOffset()/getFinalPositionIncrement(). One option is to add a set/getFinalOffset to OffsetAttribute. The downside here is it's another int added to that class, that gets copied for caching streams yet is only used at the very end. Another option is to "define" the API such that when incrementToken() returns false, then it has actually advanced to an "end-of-stream token". OffsetAttribute.getEndOffset() should return the final offset. Since we have not released the new API, we could simply make this change (and fix all instances in the core/contrib that use the new API accordingly). I think I like this option best. Yet another option is to open up "per stream" attrs rather than "per token attrs". This seems like alot of added complexity. Are there other things, besides these two, that would be an example of a "per stream" attr?
          Hide
          Mark Miller added a comment -

          Another option is to "define" the API such that when incrementToken()

          returns false, then it has actually advanced to an "end-of-stream
          token". OffsetAttribute.getEndOffset() should return the final
          offset. Since we have not released the new API, we could simply make
          this change (and fix all instances in the core/contrib that use the
          new API accordingly). I think I like this option best.

          +1. I like this.

          Show
          Mark Miller added a comment - Another option is to "define" the API such that when incrementToken() returns false, then it has actually advanced to an "end-of-stream token". OffsetAttribute.getEndOffset() should return the final offset. Since we have not released the new API, we could simply make this change (and fix all instances in the core/contrib that use the new API accordingly). I think I like this option best. +1. I like this.
          Hide
          Michael McCandless added a comment -

          OK, me too. I'll move forward with that approach.

          Show
          Michael McCandless added a comment - OK, me too. I'll move forward with that approach.
          Hide
          Michael Busch added a comment -

          Another option is to "define" the API such that when incrementToken()
          returns false, then it has actually advanced to an "end-of-stream
          token". OffsetAttribute.getEndOffset() should return the final
          offset. Since we have not released the new API, we could simply make
          this change (and fix all instances in the core/contrib that use the
          new API accordingly). I think I like this option best.

          This adds some "cleaning up" responsibilities to all existing
          TokenFilters out there. So far it is very straightforward to change an
          existing TokenFilter to use the new API. You simply have to:

          • add attributes the filter needs in its constructor
          • change next() to incrementToken() and change return calls that
            return null to false, others to true (or what input returns)
          • don't access a token but the appropriate attributes to set the data

          But maybe there's a custom filter in the end of the chain that returns
          more tokens even after its input returned the last one. For example a
          SynonymExpansionFilter might return a synonym for the last word it
          received from its input before it returns false. In this case it might
          overwrite endOffset that another filter/stream already set to the
          final endOffset. It needs to cache that value and set it when it
          returns false.

          ALso all filters that currently use an offset need to know now to
          clean up before returning false.

          I'm not saying this is necessarily bad. I also find this approach
          tempting, because it's simple. But it might be a common pitfall for
          bugs?

          What I'd like to work on soon is an efficient way to buffer attributes
          (maybe add methods to attribute that write into a bytebuffer). Then
          attributes can implement what variables need to be serialized and
          which ones don't. In that case we could add a finalOffset to
          OffsetAttribute that does not get serialiezd/deserialized.

          And possibly it might be worthwhile to have explicit states defined in
          a TokenStream that we can enforce with three methods: start(),
          increment(), end(). Then people would now if they have to do something
          at the end of a stream they have to do it in end().

          Show
          Michael Busch added a comment - Another option is to "define" the API such that when incrementToken() returns false, then it has actually advanced to an "end-of-stream token". OffsetAttribute.getEndOffset() should return the final offset. Since we have not released the new API, we could simply make this change (and fix all instances in the core/contrib that use the new API accordingly). I think I like this option best. This adds some "cleaning up" responsibilities to all existing TokenFilters out there. So far it is very straightforward to change an existing TokenFilter to use the new API. You simply have to: add attributes the filter needs in its constructor change next() to incrementToken() and change return calls that return null to false, others to true (or what input returns) don't access a token but the appropriate attributes to set the data But maybe there's a custom filter in the end of the chain that returns more tokens even after its input returned the last one. For example a SynonymExpansionFilter might return a synonym for the last word it received from its input before it returns false. In this case it might overwrite endOffset that another filter/stream already set to the final endOffset. It needs to cache that value and set it when it returns false. ALso all filters that currently use an offset need to know now to clean up before returning false. I'm not saying this is necessarily bad. I also find this approach tempting, because it's simple. But it might be a common pitfall for bugs? What I'd like to work on soon is an efficient way to buffer attributes (maybe add methods to attribute that write into a bytebuffer). Then attributes can implement what variables need to be serialized and which ones don't. In that case we could add a finalOffset to OffsetAttribute that does not get serialiezd/deserialized. And possibly it might be worthwhile to have explicit states defined in a TokenStream that we can enforce with three methods: start(), increment(), end(). Then people would now if they have to do something at the end of a stream they have to do it in end().
          Hide
          Michael McCandless added a comment -

          What I'd like to work on soon is an efficient way to buffer attributes
          (maybe add methods to attribute that write into a bytebuffer). Then
          attributes can implement what variables need to be serialized and
          which ones don't. In that case we could add a finalOffset to
          OffsetAttribute that does not get serialiezd/deserialized.

          I like that (it'd make streams like CachingTokenFilter much more
          efficient). It'd also presumably lead to more efficiently serialized
          token streams.

          But: you'd still need a way in this model to serialize finalOffset, once,
          at the end?

          And possibly it might be worthwhile to have explicit states defined in
          a TokenStream that we can enforce with three methods: start(),
          increment(), end(). Then people would now if they have to do something
          at the end of a stream they have to do it in end().

          This also seems good. So end() would be the obvious place to set
          the OffsetAttribute.finalOffset, PositionIncrementAttribute.positionIncrementGap, etc.

          OK I'm gonna assign this one to you, Michael

          Show
          Michael McCandless added a comment - What I'd like to work on soon is an efficient way to buffer attributes (maybe add methods to attribute that write into a bytebuffer). Then attributes can implement what variables need to be serialized and which ones don't. In that case we could add a finalOffset to OffsetAttribute that does not get serialiezd/deserialized. I like that (it'd make streams like CachingTokenFilter much more efficient). It'd also presumably lead to more efficiently serialized token streams. But: you'd still need a way in this model to serialize finalOffset, once, at the end? And possibly it might be worthwhile to have explicit states defined in a TokenStream that we can enforce with three methods: start(), increment(), end(). Then people would now if they have to do something at the end of a stream they have to do it in end(). This also seems good. So end() would be the obvious place to set the OffsetAttribute.finalOffset, PositionIncrementAttribute.positionIncrementGap, etc. OK I'm gonna assign this one to you, Michael
          Hide
          Michael Busch added a comment -

          But: you'd still need a way in this model to serialize finalOffset, once,
          at the end?

          Maybe we can introduce an abstract EndOfStreamAttribute and
          FinalOffsetAttribute and FinalPosIncrAttribute that extend EOSA.

          Then in a stream like CachingTokenFilter a AttributeAcceptor can
          be used that doesn't accept attributes of type EOSA in increment().

          In end() it would use an AttributeAcceptor that accepts EOSA atts
          and cache those.

          OK I'm gonna assign this one to you, Michael

          Bummer! Why did I say anything? j/k

          Show
          Michael Busch added a comment - But: you'd still need a way in this model to serialize finalOffset, once, at the end? Maybe we can introduce an abstract EndOfStreamAttribute and FinalOffsetAttribute and FinalPosIncrAttribute that extend EOSA. Then in a stream like CachingTokenFilter a AttributeAcceptor can be used that doesn't accept attributes of type EOSA in increment(). In end() it would use an AttributeAcceptor that accepts EOSA atts and cache those. OK I'm gonna assign this one to you, Michael Bummer! Why did I say anything? j/k
          Hide
          Michael McCandless added a comment -

          See also LUCENE-579 which looks like a dup of this one.

          Michael what's the game plan on this issue? I think your EOSA approach makes sense...

          Show
          Michael McCandless added a comment - See also LUCENE-579 which looks like a dup of this one. Michael what's the game plan on this issue? I think your EOSA approach makes sense...
          Hide
          Michael Busch added a comment -

          I'm currently on vacation visiting my family in Germany till the 11th. I'm planning to work on this as soon as I'm back to get all the TokenStream changes (also LUCENE-1460) ready before 2.9.

          Show
          Michael Busch added a comment - I'm currently on vacation visiting my family in Germany till the 11th. I'm planning to work on this as soon as I'm back to get all the TokenStream changes (also LUCENE-1460 ) ready before 2.9.
          Hide
          Michael McCandless added a comment -

          Michael are you going to get to this soonish? Else let's push until after 3.0?

          Show
          Michael McCandless added a comment - Michael are you going to get to this soonish? Else let's push until after 3.0?
          Hide
          Mark Miller added a comment -

          Will you have time for this Michael?

          It would be great to have this bug fixed for 2.9, but if we have to push to 3.0 its not the end of the word. Wasnt it done till that darn new Token API came along?

          Show
          Mark Miller added a comment - Will you have time for this Michael? It would be great to have this bug fixed for 2.9, but if we have to push to 3.0 its not the end of the word. Wasnt it done till that darn new Token API came along?
          Hide
          Michael Busch added a comment -

          Oh man, what did I do suggesting you as the RM?Unable to render embedded object: File (? Now there's another guy chasing me) not found.

          Currently I have to sacrifice some of my already very limited sleep for everything I do on Lucene. After next week I'll have more time. When everything else for 2.9 is done, then I don't think this should block it. Otherwise, I'll try to do it for 2.9.

          Show
          Michael Busch added a comment - Oh man, what did I do suggesting you as the RM? Unable to render embedded object: File (? Now there's another guy chasing me) not found. Currently I have to sacrifice some of my already very limited sleep for everything I do on Lucene. After next week I'll have more time. When everything else for 2.9 is done, then I don't think this should block it. Otherwise, I'll try to do it for 2.9.
          Hide
          Michael Busch added a comment -

          OK I think I have this basically working with old and new API (including 1693 changes).

          The approach I took is fairly simple, it doesn't require adding a new Attribute. I added the following method to TokenSteam:

            /**
             * This method is called by the consumer after the last token has been consumed, 
             * i.e. after {@link #incrementToken()} returned <code>false</code> (using the new TokenStream API)
             * or after {@link #next(Token)} or {@link #next()} returned <code>null</code> (old TokenStream API).
             * <p/>
             * This method can be used to perform any end-of-stream operations, such as setting the final
             * offset of a stream. The final offset of a stream might differ from the offset of the last token
             * e.g. in case one or more whitespaces followed after the last token, but a {@link WhitespaceTokenizer}
             * was used.
             * <p/>
             * 
             * @throws IOException
             */
            public void end() throws IOException {
              // do nothing by default
            }
          

          Then I took Mike's patch and implemented end() in all classes where his patch added getFinalOffset().
          E.g. in CharTokenizer the implementations looks like this:

            public void end() {
              // set final offset
              int finalOffset = input.correctOffset(offset);
              offsetAtt.setOffset(finalOffset, finalOffset);
            }
          

          I changed DocInverterPerField to call end() after the stream is fully consumed and use what
          offsetAttribute.endOffset() returns as final offset.

          I also added all new tests from Mike's latest patch.
          All unit tests, including the new ones, pass. Also test-tag.

          I'm not posting a patch yet, because this depends on 1693.

          Mike, Uwe, others: could you please review if this approach makes sense?

          Show
          Michael Busch added a comment - OK I think I have this basically working with old and new API (including 1693 changes). The approach I took is fairly simple, it doesn't require adding a new Attribute. I added the following method to TokenSteam: /** * This method is called by the consumer after the last token has been consumed, * i.e. after {@link #incrementToken()} returned <code> false </code> (using the new TokenStream API) * or after {@link #next(Token)} or {@link #next()} returned <code> null </code> (old TokenStream API). * <p/> * This method can be used to perform any end-of-stream operations, such as setting the final * offset of a stream. The final offset of a stream might differ from the offset of the last token * e.g. in case one or more whitespaces followed after the last token, but a {@link WhitespaceTokenizer} * was used. * <p/> * * @ throws IOException */ public void end() throws IOException { // do nothing by default } Then I took Mike's patch and implemented end() in all classes where his patch added getFinalOffset(). E.g. in CharTokenizer the implementations looks like this: public void end() { // set final offset int finalOffset = input.correctOffset(offset); offsetAtt.setOffset(finalOffset, finalOffset); } I changed DocInverterPerField to call end() after the stream is fully consumed and use what offsetAttribute.endOffset() returns as final offset. I also added all new tests from Mike's latest patch. All unit tests, including the new ones, pass. Also test-tag. I'm not posting a patch yet, because this depends on 1693. Mike, Uwe, others: could you please review if this approach makes sense?
          Hide
          Michael Busch added a comment -

          Hmm one thing I haven't done yet is changing Tee/Sink and CachingTokenFilter.

          But it should be simple: CachingTokenFilter.end() should call input.end() when
          it is called for the first time and store the captured state locally as finalState.
          Then whenever CachingTokenFilter.end() is called again, it just restores the
          finalState.

          For Tee/Sink it should work similarly: The tee just puts a finalState into the
          sink(s) the first time end() is called. And when end() of a sink is called it
          restores the finalState.

          This should work?

          Show
          Michael Busch added a comment - Hmm one thing I haven't done yet is changing Tee/Sink and CachingTokenFilter. But it should be simple: CachingTokenFilter.end() should call input.end() when it is called for the first time and store the captured state locally as finalState. Then whenever CachingTokenFilter.end() is called again, it just restores the finalState. For Tee/Sink it should work similarly: The tee just puts a finalState into the sink(s) the first time end() is called. And when end() of a sink is called it restores the finalState. This should work?
          Hide
          Michael Busch added a comment -

          Hmm another reason why I don't like two Tees feeding one Sink:

          What is the finalOffset and finalState then?

          Show
          Michael Busch added a comment - Hmm another reason why I don't like two Tees feeding one Sink: What is the finalOffset and finalState then?
          Hide
          Uwe Schindler added a comment - - edited

          This is not the only problem with multiple Tees: The offsets are also completely mixed together, especially if the two tees feed into the sink at the same time (not after each other). In my opinion, the last call to end should be cached by the sink as end state (so if two tees add a end state to the sink, the second one overwrites the first one).

          Show
          Uwe Schindler added a comment - - edited This is not the only problem with multiple Tees: The offsets are also completely mixed together, especially if the two tees feed into the sink at the same time (not after each other). In my opinion, the last call to end should be cached by the sink as end state (so if two tees add a end state to the sink, the second one overwrites the first one).
          Hide
          Michael McCandless added a comment -

          This approach (adding end()) sounds good!

          Show
          Michael McCandless added a comment - This approach (adding end()) sounds good!
          Hide
          Michael Busch added a comment -

          Cool, I will take this approach and submit a patch as soon as LUCENE-1693 is committed.

          Show
          Michael Busch added a comment - Cool, I will take this approach and submit a patch as soon as LUCENE-1693 is committed.
          Hide
          Michael Busch added a comment -

          I changed Mike's last patch so that it uses the new end() approach that I explained above.

          It also applies cleanly on current trunk (now that LUCENE-1693 is committed).

          I also made the changes to CachingTokenFilter and TeeSinkTokenFilter that I mentioned above and added two more testcases to TestIndexWriter.

          All tests pass.

          Show
          Michael Busch added a comment - I changed Mike's last patch so that it uses the new end() approach that I explained above. It also applies cleanly on current trunk (now that LUCENE-1693 is committed). I also made the changes to CachingTokenFilter and TeeSinkTokenFilter that I mentioned above and added two more testcases to TestIndexWriter. All tests pass.
          Hide
          Michael Busch added a comment -

          Note that my latest patch only contains fixes for the core TokenStreams.

          I'll open a separate issue to implement end() for the contrib TokenStreams, which we can commit after LUCENE-1460 is resolved.

          Show
          Michael Busch added a comment - Note that my latest patch only contains fixes for the core TokenStreams. I'll open a separate issue to implement end() for the contrib TokenStreams, which we can commit after LUCENE-1460 is resolved.
          Hide
          Michael Busch added a comment -
          • updated to trunk
          • made end() final in all implementing TokenStreams

          I'm planning to commit this soon.

          Show
          Michael Busch added a comment - updated to trunk made end() final in all implementing TokenStreams I'm planning to commit this soon.
          Hide
          Michael Busch added a comment -

          Committed revision 797715.

          Thanks Mike & Mark for the original patch!

          Show
          Michael Busch added a comment - Committed revision 797715. Thanks Mike & Mark for the original patch!

            People

            • Assignee:
              Michael Busch
              Reporter:
              Michael McCandless
            • Votes:
              2 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development