Lucene - Core
  1. Lucene - Core
  2. LUCENE-1801

Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is a followup for LUCENE-1796:

      Token.clear() used to be called by the consumer... but then it was switched to the producer here: LUCENE-1101
      I don't know if all of the Tokenizers in lucene were ever changed, but in any case it looks like at least some of these bugs were introduced with the switch to the attribute API - for example StandardTokenizer did clear it's reusableToken... and now it doesn't.

      As alternative to changing all core/contrib Tokenizers to call clearAttributes first, we could do this in the indexer, what would be a overhead for old token streams that itsself clear their reusable token. This issue should also update the Javadocs, to clearly state inside Tokenizer.java, that the source TokenStream (normally the Tokenizer) should clear all Attributes. If it does not do it and e.g. the positionIncrement is changed to 0 by any TokenFilter, but the filter does not change it back to 1, the TokenStream would stay with 0. If the TokenFilter would call PositionIncrementAttribute.clear() (because he is responsible), it could also break the TokenStream, because clear() is a general method for the whole attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() would also clear offsets and termLength, which is not wanted. So the source of the Tokenization should rest the attributes to default values.

      LUCENE-1796 removed the iterator creation cost, so clearAttributes should run fast, but is an additional cost during Tokenization, as it was not done consistently before, so a small speed degradion is caused by this, but has nothing to do with the new TokenStream API.

      1. LUCENE-1801.patch
        13 kB
        Uwe Schindler
      2. LUCENE-1801.patch
        10 kB
        Robert Muir
      3. LUCENE-1801.patch
        9 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          There is an additional problem (mentioned above):
          There is currently defined the clear() in Attribute interface, but this leads to the problem, that it is really more the task of AttributeImpl. The porblem is, that e.g. one TokenFilter, that adds a new type of attribute must could clear it. On the other hand, Token, which implements all attributes could also provide clear(). By this, there is an inconsistency: clear shold be removed from the general Attribute interface and moved downto each separate interface with a separate name. E.g. if somebody calls TermAttribute.clear, but may think that it only clears the term attribute may be wrong, if the actual implementation is Token, whic clears everything.

          The biggest problem is backwards compatibility. Lucene 2.4.1 states in JavaDocs of Token: "public void clear(): Resets the term text, payload, flags, and positionIncrement to default. Other fields such as startOffset, endOffset and the token type are not reset since they are normally overwritten by the tokenizer."

          I would propose to change the whole thing:

          • Remove clear() from the superinterface Attribute.
          • Let do Token what it is used to (as of 2.4.1)
          • Define a clearTerm(), clearPositionIncrement(), clearFlags() method for each attribute type separate (Token/TokenWrapper must implement it).
          • clear() is only defined in AttributeImpl and clears the whole implementation. AttributeSource.clearAttributes calls this method. Current code calling clear() on the attribute interface will fail to compile, but these are the places that must be fixed.

          The problem of backwards compatibility can be solved the following way:

          • TokenWrapper clears the complete delegate Token to be consistent with AttributeSource.clearAttributes() - complete reset to default values incl offset. A problem only occurs if somebody registers Token (not the wrapper around as Attribute), then clearAttributes() would not be consistent with the rest, as it would miss to clear the offset.

          How will we handle the clearAttributes() call in Tokenizers then? Should we only clear those attributes we work on in a Tokenizer/TokenFilter?

          Show
          Uwe Schindler added a comment - There is an additional problem (mentioned above): There is currently defined the clear() in Attribute interface, but this leads to the problem, that it is really more the task of AttributeImpl. The porblem is, that e.g. one TokenFilter, that adds a new type of attribute must could clear it. On the other hand, Token, which implements all attributes could also provide clear(). By this, there is an inconsistency: clear shold be removed from the general Attribute interface and moved downto each separate interface with a separate name. E.g. if somebody calls TermAttribute.clear, but may think that it only clears the term attribute may be wrong, if the actual implementation is Token, whic clears everything. The biggest problem is backwards compatibility. Lucene 2.4.1 states in JavaDocs of Token: "public void clear(): Resets the term text, payload, flags, and positionIncrement to default. Other fields such as startOffset, endOffset and the token type are not reset since they are normally overwritten by the tokenizer." I would propose to change the whole thing: Remove clear() from the superinterface Attribute. Let do Token what it is used to (as of 2.4.1) Define a clearTerm(), clearPositionIncrement(), clearFlags() method for each attribute type separate (Token/TokenWrapper must implement it). clear() is only defined in AttributeImpl and clears the whole implementation. AttributeSource.clearAttributes calls this method. Current code calling clear() on the attribute interface will fail to compile, but these are the places that must be fixed. The problem of backwards compatibility can be solved the following way: TokenWrapper clears the complete delegate Token to be consistent with AttributeSource.clearAttributes() - complete reset to default values incl offset. A problem only occurs if somebody registers Token (not the wrapper around as Attribute), then clearAttributes() would not be consistent with the rest, as it would miss to clear the offset. How will we handle the clearAttributes() call in Tokenizers then? Should we only clear those attributes we work on in a Tokenizer/TokenFilter?
          Hide
          Uwe Schindler added a comment -

          Any comments here? I will be unavailable until the weekend, so please make some suggestions, especially about the clear() problem. The first part with clearAttributes is easy, its just adding some code, the second one is refactoring.
          Michael B., what do you think?

          Show
          Uwe Schindler added a comment - Any comments here? I will be unavailable until the weekend, so please make some suggestions, especially about the clear() problem. The first part with clearAttributes is easy, its just adding some code, the second one is refactoring. Michael B., what do you think?
          Hide
          Uwe Schindler added a comment -

          Attached is a patch that implements clearAttributes() in all Tokenizers and other source of Tokens. It also removes clear() from Attribute basic interface.

          I commit in a day or two if nobody objects.

          Show
          Uwe Schindler added a comment - Attached is a patch that implements clearAttributes() in all Tokenizers and other source of Tokens. It also removes clear() from Attribute basic interface. I commit in a day or two if nobody objects.
          Hide
          Robert Muir added a comment -

          Uwe, there is also a tokenizer in contrib/memory inside PatternAnalyzer.

          i only mention this because I am trying to hunt down all analyzers/tokenstreams to check reuse/reset at the moment

          Show
          Robert Muir added a comment - Uwe, there is also a tokenizer in contrib/memory inside PatternAnalyzer. i only mention this because I am trying to hunt down all analyzers/tokenstreams to check reuse/reset at the moment
          Hide
          Michael Busch added a comment -

          Sorry, Uwe. I'm in meetings.

          I'll look into this tonight!

          Show
          Michael Busch added a comment - Sorry, Uwe. I'm in meetings. I'll look into this tonight!
          Hide
          Uwe Schindler added a comment -

          Thanks Robert!

          Can you look into this special "Tokenizer" for correct "initialization" according to Yonik's comments?

          Show
          Uwe Schindler added a comment - Thanks Robert! Can you look into this special "Tokenizer" for correct "initialization" according to Yonik's comments?
          Hide
          Robert Muir added a comment -

          Uwe, I can supply updated patch to yours if you want, since I am already staring at it!

          In order to support reuse, it will need to be changed a little bit, but for now, we can simply resolve the clearAttributes issue

          Show
          Robert Muir added a comment - Uwe, I can supply updated patch to yours if you want, since I am already staring at it! In order to support reuse, it will need to be changed a little bit, but for now, we can simply resolve the clearAttributes issue
          Hide
          Robert Muir added a comment -

          with clearAttributes for the secret and super-secret tokenizer inside memory/PatternAnalyzer

          Show
          Robert Muir added a comment - with clearAttributes for the secret and super-secret tokenizer inside memory/PatternAnalyzer
          Hide
          Robert Muir added a comment -

          uwe, sorry I see there is an encoding problem with my patch file... i will supply another.

          Show
          Robert Muir added a comment - uwe, sorry I see there is an encoding problem with my patch file... i will supply another.
          Hide
          Robert Muir added a comment -

          sorry for the bad encoding issue!

          Show
          Robert Muir added a comment - sorry for the bad encoding issue!
          Hide
          Uwe Schindler added a comment -

          No problem, thanks for the patch. I was not aware that there were two inner classes. But they are no Tokenizers but TokenStreams. As they are the source of tokens, they must call clearAttributes(), you are right, thanks!

          If you find another one, please post a patch again, maybe I forgot more of them. I will go to bed now.

          Uwe

          Show
          Uwe Schindler added a comment - No problem, thanks for the patch. I was not aware that there were two inner classes. But they are no Tokenizers but TokenStreams. As they are the source of tokens, they must call clearAttributes(), you are right, thanks! If you find another one, please post a patch again, maybe I forgot more of them. I will go to bed now. Uwe
          Hide
          Robert Muir added a comment -

          Uwe, get some rest.

          I will double-check later and see. Personally I do not like things that behave as Tokenizer but are TokenStream, not Tokenizer... this is another issue for another day!

          Show
          Robert Muir added a comment - Uwe, get some rest. I will double-check later and see. Personally I do not like things that behave as Tokenizer but are TokenStream, not Tokenizer... this is another issue for another day!
          Hide
          Yonik Seeley added a comment -

          As they are the source of tokens, they must call clearAttributes()

          Note: I had assumed that restoreState() would be enough if there was saved state being restored... but after checking the docs, it's not.

          Makes me wonder if there could be a more efficient clearAndRestoreState(State) that clears attributes not in the state?
          Patch for another day I suppose...

          Show
          Yonik Seeley added a comment - As they are the source of tokens, they must call clearAttributes() Note: I had assumed that restoreState() would be enough if there was saved state being restored... but after checking the docs, it's not. Makes me wonder if there could be a more efficient clearAndRestoreState(State) that clears attributes not in the state? Patch for another day I suppose...
          Hide
          Michael Busch added a comment -

          Patch looks good, Uwe!

          When I change Token.clear() to also set the offsets to 0 and the type to DEFAULT_TYPE, then still 'test-core', 'test-contrib' and 'test-tag' all pass.
          I think we could make that change and add a comment to the backwards-compatibility section of CHANGES.txt? I think it is the right behavior to reset everything in Tokenizer.

          Also the comment in Token.clear() suggests that the only reason offset and type are not cleared is that tokenizers usually overwrite them anyways; so we're not changing the suggested behavior, and I doubt that people are really relying on the fact that offsets and type are currently not cleared?

          So in summary, if we:

          • change all tokenizers to call clearAttributes() first in incrementToken(),
          • remove clear() from Attribute and leave it in AttributeImpl,
          • change Token.clear() to reset all members and add a comment about that in CHANGES.txt,

          then everything seems good, or is then there still a problem that I'm missing?

          Show
          Michael Busch added a comment - Patch looks good, Uwe! When I change Token.clear() to also set the offsets to 0 and the type to DEFAULT_TYPE, then still 'test-core', 'test-contrib' and 'test-tag' all pass. I think we could make that change and add a comment to the backwards-compatibility section of CHANGES.txt? I think it is the right behavior to reset everything in Tokenizer. Also the comment in Token.clear() suggests that the only reason offset and type are not cleared is that tokenizers usually overwrite them anyways; so we're not changing the suggested behavior, and I doubt that people are really relying on the fact that offsets and type are currently not cleared? So in summary, if we: change all tokenizers to call clearAttributes() first in incrementToken(), remove clear() from Attribute and leave it in AttributeImpl, change Token.clear() to reset all members and add a comment about that in CHANGES.txt, then everything seems good, or is then there still a problem that I'm missing?
          Hide
          Uwe Schindler added a comment -

          Final patch with Michaels suggestion.

          Please note: I also changed Token.clearNoTermBuffer() to work identical (please tell me if I should revert this). This method is not used anywhere inside Lucene, but I want to make it work consistent, too.

          I will commit at the end of the day.

          Show
          Uwe Schindler added a comment - Final patch with Michaels suggestion. Please note: I also changed Token.clearNoTermBuffer() to work identical (please tell me if I should revert this). This method is not used anywhere inside Lucene, but I want to make it work consistent, too. I will commit at the end of the day.
          Hide
          Michael Busch added a comment -

          Patch looks good, Uwe!

          I will commit at the end of the day.

          +1.

          Show
          Michael Busch added a comment - Patch looks good, Uwe! I will commit at the end of the day. +1.
          Hide
          Uwe Schindler added a comment -

          Committed revision 804392. Thanks also to Michael and Robert!

          Show
          Uwe Schindler added a comment - Committed revision 804392. Thanks also to Michael and Robert!

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development