Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      This is a very early version of the new TokenStream API that
      we started to discuss here:

      http://www.gossamer-threads.com/lists/lucene/java-dev/66227

      This implementation is a bit different from what I initially
      proposed in the thread above. I introduced a new class called
      AttributedToken, which contains the same termBuffer logic
      from Token. In addition it has a lazily-initialized map of
      Class<? extends Attribute> -> Attribute. Attribute is also a
      new class in a new package, plus several implementations like
      PositionIncrementAttribute, PayloadAttribute, etc.

      Similar to my initial proposal is the prototypeToken() method
      which the consumer (e. g. DocumentsWriter) needs to call.
      The token is created by the tokenizer at the end of the chain
      and pushed through all filters to the end consumer. The
      tokenizer and also all filters can add Attributes to the
      token and can keep references to the actual types of the
      attributes that they need to read of modify. This way, when
      boolean nextToken() is called, no casting is necessary.

      I added a class called TestNewTokenStreamAPI which is not
      really a test case yet, but has a static demo() method, which
      demonstrates how to use the new API.

      The reason to not merge Token and TokenStream into one class
      is that we might have caching (or tee/sink) filters in the
      chain that might want to store cloned copies of the tokens
      in a cache. I added a new class NewCachingTokenStream that
      shows how such a class could work. I also implemented a deep
      clone method in AttributedToken and a
      copyFrom(AttributedToken) method, which is needed for the
      caching. Both methods have to iterate over the list of
      attributes. The Attribute subclasses itself also have a
      copyFrom(Attribute) method, which unfortunately has to down-
      cast to the actual type. I first thought that might be very
      inefficient, but it's not so bad. Well, if you add all
      Attributes to the AttributedToken that our old Token class
      had (like offsets, payload, posIncr), then the performance
      of the caching is somewhat slower (~40%). However, if you
      add less attributes, because not all might be needed, then
      the performance is even slightly faster than with the old API.
      Also the new API is flexible enough so that someone could
      implement a custom caching filter that knows all attributes
      the token can have, then the caching should be just as
      fast as with the old API.

      This patch is not nearly ready, there are lot's of things
      missing:

      • unit tests
      • change DocumentsWriter to use new API
        (in backwards-compatible fashion)
      • patch is currently java 1.5; need to change before
        commiting to 2.9
      • all TokenStreams and -Filters should be changed to use
        new API
      • javadocs incorrect or missing
      • hashcode and equals methods missing in Attributes and
        AttributedToken

      I wanted to submit it already for brave people to give me
      early feedback before I spend more time working on this.

      1. lucene-1422.patch
        33 kB
        Michael Busch
      2. lucene-1422.take2.patch
        45 kB
        Michael Busch
      3. lucene-1422.take3.patch
        79 kB
        Michael Busch
      4. lucene-1422.take3.patch
        62 kB
        Michael Busch
      5. lucene-1422-take4.patch
        155 kB
        Michael Busch
      6. lucene-1422-take5.patch
        207 kB
        Michael Busch
      7. lucene-1422-take6.patch
        225 kB
        Michael Busch

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          prototypeToken() and nextToken() seem not the most descriptive names here. Perhaps getToken() and incrementToken() would be better? Also, would it be better to make the prototypeToken/getToken implementations idempotent? This might then look like, e.g.:

          public AttributedToken getToken() throws IOException {
            if (token == null) {
              token = input.getToken();
              positionIncrement = new PositionIncrementAttribute();
              token.addAttribute(positionIncrement);
            }
            return token;
          }
          

          I'm also not fond of the name AttributedToken, but don't yet have a better suggestion...

          Show
          Doug Cutting added a comment - prototypeToken() and nextToken() seem not the most descriptive names here. Perhaps getToken() and incrementToken() would be better? Also, would it be better to make the prototypeToken/getToken implementations idempotent? This might then look like, e.g.: public AttributedToken getToken() throws IOException { if (token == null ) { token = input.getToken(); positionIncrement = new PositionIncrementAttribute(); token.addAttribute(positionIncrement); } return token; } I'm also not fond of the name AttributedToken, but don't yet have a better suggestion...
          Hide
          Michael Busch added a comment -

          Thanks for your suggestions, Doug. It makes perfect sense to make getToken() idempotent.
          Also, addAttribute() should be idempotent, because a Token can have only one instance
          of the an attribute.

          I changed prototypeToken() to getToken(), nextToken() to incrementToken and actually
          added the attribute logic to Token itself. That has some advantages, but also the
          following disadvantage. If people want to use the new API before 3.0, i. e. before the
          deprecated members of Token have been removed, and they want to use something like the
          CachingTokenFilter or Tee/Sink-TokenFilter, then caching is more expensive. The reson
          is that Token itself has members for positionIncrement, offsets, etc. but you then also
          have to add the appropriate attributes to Token to use the new API. But I think this
          drawback would be acceptable?

          I also changed the way to add attributes to a Token:

          protected final void addTokenAttributes() {
            posIncrAtt = reusableToken.addAttribute(PositionIncrementAttribute.class);
          }
          

          The addTokenAttributes() method belongs to TokenStream and is called from getToken()
          only when a new Token instance was created, i. e. in its first call.

          Note the signature of Token.addAttribute:

          public <T extends Attribute> T addAttribute(Class<T> attClass);
          

          Now you don't pass in an actual instance of *Attribute, but its class. The method will
          then create a new instance via reflection. This approach makes the addAttribute()
          method itself idempotent.

          I changed all core tokenizers and filters to have an implementation of the new API.

          For backwards-compatibility I added a (deprecated) static setUseNewAPI() method to
          TokenStream. I also changed the DocumentsWriter to use the new API in case
          useNewAPI==true;

          I still have to do several things, including javadocs, testcases, hashcode(), etc.

          Show
          Michael Busch added a comment - Thanks for your suggestions, Doug. It makes perfect sense to make getToken() idempotent. Also, addAttribute() should be idempotent, because a Token can have only one instance of the an attribute. I changed prototypeToken() to getToken(), nextToken() to incrementToken and actually added the attribute logic to Token itself. That has some advantages, but also the following disadvantage. If people want to use the new API before 3.0, i. e. before the deprecated members of Token have been removed, and they want to use something like the CachingTokenFilter or Tee/Sink-TokenFilter, then caching is more expensive. The reson is that Token itself has members for positionIncrement, offsets, etc. but you then also have to add the appropriate attributes to Token to use the new API. But I think this drawback would be acceptable? I also changed the way to add attributes to a Token: protected final void addTokenAttributes() { posIncrAtt = reusableToken.addAttribute(PositionIncrementAttribute.class); } The addTokenAttributes() method belongs to TokenStream and is called from getToken() only when a new Token instance was created, i. e. in its first call. Note the signature of Token.addAttribute: public <T extends Attribute> T addAttribute( Class <T> attClass); Now you don't pass in an actual instance of *Attribute, but its class. The method will then create a new instance via reflection. This approach makes the addAttribute() method itself idempotent. I changed all core tokenizers and filters to have an implementation of the new API. For backwards-compatibility I added a (deprecated) static setUseNewAPI() method to TokenStream. I also changed the DocumentsWriter to use the new API in case useNewAPI==true; I still have to do several things, including javadocs, testcases, hashcode(), etc.
          Hide
          Michael Busch added a comment -

          I added several things in this new patch:

          • hashCode() and equals() now incorporate the attributes
          • patch compiles against Java 1.4
          • all core tests pass with and without the new API turned
            on (via TokenStream.setUseNewAPI(true))
          • Added setToken() method to InvertedDocConsumerPerField
            and TermsHashConsumerPerField and updated the
            implementing classes. I have actually a question here,
            because I don't know these classes very well yet. Would
            it be better to add the Token to the DocInverter.FieldInvertState?
            I think I also have to review LUCENE-1426 to see if these
            changes are not in conflict ( I think 1426 should be committed
            first?)

          Outstanding:

          • dedicated junits for new APIs, even though the existing tests
            already cover a lot when setUseNewAPI(true)
          • javadocs
          • contrib streams and filters
          Show
          Michael Busch added a comment - I added several things in this new patch: hashCode() and equals() now incorporate the attributes patch compiles against Java 1.4 all core tests pass with and without the new API turned on (via TokenStream.setUseNewAPI(true)) Added setToken() method to InvertedDocConsumerPerField and TermsHashConsumerPerField and updated the implementing classes. I have actually a question here, because I don't know these classes very well yet. Would it be better to add the Token to the DocInverter.FieldInvertState? I think I also have to review LUCENE-1426 to see if these changes are not in conflict ( I think 1426 should be committed first?) Outstanding: dedicated junits for new APIs, even though the existing tests already cover a lot when setUseNewAPI(true) javadocs contrib streams and filters
          Hide
          Michael McCandless added a comment -

          Michael I think you left out oal.analysis.tokenattributes.* from your patch? (I'm trying to compile things).

          Show
          Michael McCandless added a comment - Michael I think you left out oal.analysis.tokenattributes.* from your patch? (I'm trying to compile things).
          Hide
          Michael McCandless added a comment -

          Would it be better to add the Token to the DocInverter.FieldInvertState?

          I think it would be slightly better, since it'd save a the recursive method calls to setToken mirroring what the "add(Token)" is about to do anyway, but the problem is the token can be different for each Field being added. EG, I could have a doc with 10 instances of field "foo", whereby each instance is doing a different analysis and so the reused token may change. But the InvertedDocConsumer API doesn't have a start() method for each instance of a field, and so there's no clean place for each consumer to lookup the attribute(s) it needs to use? Maybe you could add such a start method, and then switch to putting the Token into FieldInvertState?

          I think I also have to review LUCENE-1426 to see if these
          changes are not in conflict ( I think 1426 should be committed
          first?)

          OK I will commit LUCENE-1426 first. I think mostly there won't be conflicts, except in FreqProxTermsWriterPerThread which should be minor.

          Show
          Michael McCandless added a comment - Would it be better to add the Token to the DocInverter.FieldInvertState? I think it would be slightly better, since it'd save a the recursive method calls to setToken mirroring what the "add(Token)" is about to do anyway, but the problem is the token can be different for each Field being added. EG, I could have a doc with 10 instances of field "foo", whereby each instance is doing a different analysis and so the reused token may change. But the InvertedDocConsumer API doesn't have a start() method for each instance of a field, and so there's no clean place for each consumer to lookup the attribute(s) it needs to use? Maybe you could add such a start method, and then switch to putting the Token into FieldInvertState? I think I also have to review LUCENE-1426 to see if these changes are not in conflict ( I think 1426 should be committed first?) OK I will commit LUCENE-1426 first. I think mostly there won't be conflicts, except in FreqProxTermsWriterPerThread which should be minor.
          Hide
          Michael Busch added a comment -

          Oups, sorry, this should work now.

          Another question: Is it correct that the DocInverterPerField
          only takes into account the endOffset(), not the startOffset()?

          Show
          Michael Busch added a comment - Oups, sorry, this should work now. Another question: Is it correct that the DocInverterPerField only takes into account the endOffset(), not the startOffset()?
          Hide
          Michael McCandless added a comment -

          Another question: Is it correct that the DocInverterPerField
          only takes into account the endOffset(), not the startOffset()?

          I think it's correct, if rather confusing: DocInverterPerField tracks endOffset only for shifting the offsets of Tokens when a document has more than one Fieldable instance per textual field. (Other consumers, like FreqProxTermsWriter do need to pay attention to both the start & end offset).

          Show
          Michael McCandless added a comment - Another question: Is it correct that the DocInverterPerField only takes into account the endOffset(), not the startOffset()? I think it's correct, if rather confusing: DocInverterPerField tracks endOffset only for shifting the offsets of Tokens when a document has more than one Fieldable instance per textual field. (Other consumers, like FreqProxTermsWriter do need to pay attention to both the start & end offset).
          Hide
          Michael Busch added a comment -

          I think it would be slightly better, since it'd save a the recursive method calls to setToken mirroring what the "add(Token)" is about to do anyway, but the problem is the token can be different for each Field being added. EG, I could have a doc with 10 instances of field "foo", whereby each instance is doing a different analysis and so the reused token may change. But the InvertedDocConsumer API doesn't have a start() method for each instance of a field, and so there's no clean place for each consumer to lookup the attribute(s) it needs to use? Maybe you could add such a start method, and then switch to putting the Token into FieldInvertState?

          Yeah I agree, that's better. I should probably add a start(Fieldable) method? I currently don't need to access the Fieldable instance in the start method, but it can't hurt to pass it along?

          Show
          Michael Busch added a comment - I think it would be slightly better, since it'd save a the recursive method calls to setToken mirroring what the "add(Token)" is about to do anyway, but the problem is the token can be different for each Field being added. EG, I could have a doc with 10 instances of field "foo", whereby each instance is doing a different analysis and so the reused token may change. But the InvertedDocConsumer API doesn't have a start() method for each instance of a field, and so there's no clean place for each consumer to lookup the attribute(s) it needs to use? Maybe you could add such a start method, and then switch to putting the Token into FieldInvertState? Yeah I agree, that's better. I should probably add a start(Fieldable) method? I currently don't need to access the Fieldable instance in the start method, but it can't hurt to pass it along?
          Hide
          Michael McCandless added a comment -

          I should probably add a start(Fieldable) method? I currently don't need to access the Fieldable instance in the start method, but it can't hurt to pass it along?

          I think that's the right approach!

          Show
          Michael McCandless added a comment - I should probably add a start(Fieldable) method? I currently don't need to access the Fieldable instance in the start method, but it can't hurt to pass it along? I think that's the right approach!
          Hide
          Michael Busch added a comment -

          I think that's the right approach!

          OK done. Works great...

          Another question. Currently the members in Token, like positionIncrement are deprecated with a comment that they will be made private. We should be able to do so in the next release, i. e. 2.9, right? It's not a public or protected API, so there shouldn't be the need to wait for 3.0 based on our backwards-compatibility policy. Or did I miss a discussion behind this, maybe because Token is a class that many people extend in the o.a.l.a package and we agreed to make an exception here?

          The reason why I'm asking is Grant's suggestion of putting all default Attributes into Token and having methods like getPositionIncrement() return the value from the corresponding PositionIncrementAttribute. It won't be possible to do that if I can't remove those mentioned members, unless I subclass Token or with a performance hit, if I do something like this for example:

          public int getPositionIncrement() {
            if (this.positionIncrementAttribute != null) {
              return this.positionIncrementAttribute.getPositionIncrement();
            else {
              return this.positionIncrement;
            }
          }
          

          Seems inefficient and messy.

          So my question is if it's ok if I remove the deprecated members with this patch in the 2.9 release? Thoughts?

          Show
          Michael Busch added a comment - I think that's the right approach! OK done. Works great... Another question. Currently the members in Token, like positionIncrement are deprecated with a comment that they will be made private. We should be able to do so in the next release, i. e. 2.9, right? It's not a public or protected API, so there shouldn't be the need to wait for 3.0 based on our backwards-compatibility policy. Or did I miss a discussion behind this, maybe because Token is a class that many people extend in the o.a.l.a package and we agreed to make an exception here? The reason why I'm asking is Grant's suggestion of putting all default Attributes into Token and having methods like getPositionIncrement() return the value from the corresponding PositionIncrementAttribute. It won't be possible to do that if I can't remove those mentioned members, unless I subclass Token or with a performance hit, if I do something like this for example: public int getPositionIncrement() { if ( this .positionIncrementAttribute != null ) { return this .positionIncrementAttribute.getPositionIncrement(); else { return this .positionIncrement; } } Seems inefficient and messy. So my question is if it's ok if I remove the deprecated members with this patch in the 2.9 release? Thoughts?
          Hide
          DM Smith added a comment -

          Another question. Currently the members in Token, like positionIncrement are deprecated with a comment that they will be made private.

          These members were being directly accessed in various contribs. To me that makes them part of the published API, whether intentional or not. All of the code in Lucene core and contrib now use accessors.

          I marked them as deprecated so that people would have time to clean up their code. My assumption was that even though it is package protected that someone might create their own "contrib" with their Token subclass in o.a.l.analysis. I did't think the backward compatibility policy discussed package protected and I wasn't eager to open that discussion.

          Strictly speaking, it does break backward compatibility.

          But, it should have never had package protected members. For precisely this reason.

          Show
          DM Smith added a comment - Another question. Currently the members in Token, like positionIncrement are deprecated with a comment that they will be made private. These members were being directly accessed in various contribs. To me that makes them part of the published API, whether intentional or not. All of the code in Lucene core and contrib now use accessors. I marked them as deprecated so that people would have time to clean up their code. My assumption was that even though it is package protected that someone might create their own "contrib" with their Token subclass in o.a.l.analysis. I did't think the backward compatibility policy discussed package protected and I wasn't eager to open that discussion. Strictly speaking, it does break backward compatibility. But, it should have never had package protected members. For precisely this reason.
          Hide
          Michael Busch added a comment -

          Strictly speaking, it does break backward compatibility.

          Yes I agree. But my take here is that the package-private methods are expert methods for which we don't have to guarantee backwards-compatibility the same way we do for public and protected APIs (i. e. only break compatibility in a version change X.Y->(X+1).0). Of course we have to update all contribs.

          Show
          Michael Busch added a comment - Strictly speaking, it does break backward compatibility. Yes I agree. But my take here is that the package-private methods are expert methods for which we don't have to guarantee backwards-compatibility the same way we do for public and protected APIs (i. e. only break compatibility in a version change X.Y->(X+1).0). Of course we have to update all contribs.
          Hide
          Grant Ingersoll added a comment -

          Yes I agree. But my take here is that the package-private methods are expert methods for which we don't have to guarantee backwards-compatibility the same way we do for public and protected APIs (i. e. only break compatibility in a version change X.Y->(X+1).0). Of course we have to update all contribs.

          I don't know about that. I don't think it has been decided one way or the other. At one extreme, the policy says: "That's to say, any code developed against X.0 should continue to run without alteration against all X.N releases. " Since, it is perfectly legal to access package methods by declaring oneself to be part of that package, and, since we don't sign our jars to prevent such a move, I can see that others may feel free to use them when they deem it beneficial. On the other hand, convention/good programming style suggests it's not good to write programs that rely on package level APIs in libraries, so caveat emptor.

          I do note that we broke Luke once before when we changed something that was package scoped. Maybe a shame on us, maybe a shame on Luke, I don't know.

          That being said, I doubt a lot of people are doing this, so...
          I'd suggest a vote on it on dev if breaking it is decided on, like we did w/ Fieldable and that we explicitly state what is happening in CHANGES, etc.

          Said vote could either be specific to this problem, or it could attempt to address the question of whether package scoped things are subject to the back-compat. policy.

          Show
          Grant Ingersoll added a comment - Yes I agree. But my take here is that the package-private methods are expert methods for which we don't have to guarantee backwards-compatibility the same way we do for public and protected APIs (i. e. only break compatibility in a version change X.Y->(X+1).0). Of course we have to update all contribs. I don't know about that. I don't think it has been decided one way or the other. At one extreme, the policy says: "That's to say, any code developed against X.0 should continue to run without alteration against all X.N releases. " Since, it is perfectly legal to access package methods by declaring oneself to be part of that package, and, since we don't sign our jars to prevent such a move, I can see that others may feel free to use them when they deem it beneficial. On the other hand, convention/good programming style suggests it's not good to write programs that rely on package level APIs in libraries, so caveat emptor. I do note that we broke Luke once before when we changed something that was package scoped. Maybe a shame on us, maybe a shame on Luke, I don't know. That being said, I doubt a lot of people are doing this, so... I'd suggest a vote on it on dev if breaking it is decided on, like we did w/ Fieldable and that we explicitly state what is happening in CHANGES, etc. Said vote could either be specific to this problem, or it could attempt to address the question of whether package scoped things are subject to the back-compat. policy.
          Hide
          Michael Busch added a comment -

          Said vote could either be specific to this problem, or it could attempt to address the question of whether package scoped things are subject to the back-compat. policy.

          Agreed. I'd prefer the latter. I will call a vote on java-dev later today...

          Show
          Michael Busch added a comment - Said vote could either be specific to this problem, or it could attempt to address the question of whether package scoped things are subject to the back-compat. policy. Agreed. I'd prefer the latter. I will call a vote on java-dev later today...
          Hide
          Michael Busch added a comment -

          Because mulitple people mentioned it would be better to merge
          TokenStream and Token into one class, I thought more about it
          and now I think I prefer that approach too. I implemented it
          and added a class TokenStreamState to capture a state of a stream
          which can be used for buffering of Tokens. The performance of the
          CachingTokenFilter is lower than before if all attributes that the
          Token had are used, but slightly better if fewer are used (which
          was previously not possible). I also had some ideas of making
          buffering of Tokens perform better, but this patch is now
          already pretty long, so I decided to add better buffering with
          a separate issue at a later point. Here is a sumamry of my
          changes:

          Changes in the analysis package:

          • Added the tokenattributes subpackage, known from the previous
            patches
          • Added the abstract class AttributeSource that owns the
            attribute map and appropriate methods to get and add the
            attributes. TokenStream now extends AttributeSource.
          • Deprecated the Token class.
          • Added TokenStream#start(), TokenStream#initialize() and
            TokenStream#incrementToken(). start() must be called by a
            consumer before incrementToken() is called the first time.
            start() calls initialize(), which can be used by TokenStreams
            and TokenFilters to add or get attributes. I separated
            start() and initialize() to enforce in TokenFilters that
            input.start() is called. I think it would be a pitfall for
            bugs (happened to me before I added initialize().
          • Added another subclass of AttributeSource called
            TokenStreamState which can be used to capture a current state
            of a TokenStream, e. g. for buffering purposes. I changed the
            CachingTokenFilter and Sink/Tee-TokenFilter to make use of
            this new class.
          • Changed all core TokenStreams and TokenFilters to implement
            the new methods and deprecated the next(Token) methods, but
            left them for backwards compatibility.

          Changes in the indexer package:

          • Changed DocInverterPerField.processFields to use the new API
            if TokenStream.useNewAPI() is set to true. I also added an
            inner class to DocInverterPerThread called
            BackwardsCompatibilityStream that allows me to set a Token
            and all Attributes just return the values from the token.
            That allows me to change all consuemrs in the indexer
            package to not use Token anymore at all, but only
            TokenStream, without a performance hit.
          • Added start(Fieldable) method to InvertedDocConsumerPerField
            and TermsHashConsumerPerField that is called to notify the
            consumers that one field instance is now going to be
            processed, so that they can get the attribute references
            from DocInverter.FieldInvertState.attributeSource. Also
            changed the signature of the add() method of the above
            mentioned classes to not take a Token anymore.

          Changes in queryparser package:

          • Changed QueryParser so that it uses a CachingTokenFilter
            instead of a List to buffer tokens.

          Testcases:

          • Added class TokenStreamTestUtils to the analysis test
            package which contains two inner helper classes:
            BackwardsCompatibleStream and BackwardsCompatibleFilter.
            Both overwrite TokenStream#next(Token) and call
            incrementToken() and then copy all attribute values to the
            Token to be returned to the caller of next(Token). That
            simplifies it to make existing tests run in old and new API
            mode.

          All test cases pass with useNewAPI=true and false. I think
          this patch is mostly done now (I just have to update
          analysis/package.html and cleanup imports), unless we're not
          happy with the APIs.
          Please give me some feedback about this approach.

          Show
          Michael Busch added a comment - Because mulitple people mentioned it would be better to merge TokenStream and Token into one class, I thought more about it and now I think I prefer that approach too. I implemented it and added a class TokenStreamState to capture a state of a stream which can be used for buffering of Tokens. The performance of the CachingTokenFilter is lower than before if all attributes that the Token had are used, but slightly better if fewer are used (which was previously not possible). I also had some ideas of making buffering of Tokens perform better, but this patch is now already pretty long, so I decided to add better buffering with a separate issue at a later point. Here is a sumamry of my changes: Changes in the analysis package: Added the tokenattributes subpackage, known from the previous patches Added the abstract class AttributeSource that owns the attribute map and appropriate methods to get and add the attributes. TokenStream now extends AttributeSource. Deprecated the Token class. Added TokenStream#start(), TokenStream#initialize() and TokenStream#incrementToken(). start() must be called by a consumer before incrementToken() is called the first time. start() calls initialize(), which can be used by TokenStreams and TokenFilters to add or get attributes. I separated start() and initialize() to enforce in TokenFilters that input.start() is called. I think it would be a pitfall for bugs (happened to me before I added initialize(). Added another subclass of AttributeSource called TokenStreamState which can be used to capture a current state of a TokenStream, e. g. for buffering purposes. I changed the CachingTokenFilter and Sink/Tee-TokenFilter to make use of this new class. Changed all core TokenStreams and TokenFilters to implement the new methods and deprecated the next(Token) methods, but left them for backwards compatibility. Changes in the indexer package: Changed DocInverterPerField.processFields to use the new API if TokenStream.useNewAPI() is set to true. I also added an inner class to DocInverterPerThread called BackwardsCompatibilityStream that allows me to set a Token and all Attributes just return the values from the token. That allows me to change all consuemrs in the indexer package to not use Token anymore at all, but only TokenStream, without a performance hit. Added start(Fieldable) method to InvertedDocConsumerPerField and TermsHashConsumerPerField that is called to notify the consumers that one field instance is now going to be processed, so that they can get the attribute references from DocInverter.FieldInvertState.attributeSource. Also changed the signature of the add() method of the above mentioned classes to not take a Token anymore. Changes in queryparser package: Changed QueryParser so that it uses a CachingTokenFilter instead of a List to buffer tokens. Testcases: Added class TokenStreamTestUtils to the analysis test package which contains two inner helper classes: BackwardsCompatibleStream and BackwardsCompatibleFilter. Both overwrite TokenStream#next(Token) and call incrementToken() and then copy all attribute values to the Token to be returned to the caller of next(Token). That simplifies it to make existing tests run in old and new API mode. All test cases pass with useNewAPI=true and false. I think this patch is mostly done now (I just have to update analysis/package.html and cleanup imports), unless we're not happy with the APIs. Please give me some feedback about this approach.
          Hide
          Grant Ingersoll added a comment -

          Because mulitple people mentioned it would be better to merge
          TokenStream and Token into one class,

          They did? I only see Doug's comment, is there another one? Would be good to see the reasoning. Not that I'm against what you've done, just curious.

          Some questions (I've only glanced at the patch, so I still need to apply the patch):

          Should useNewAPI be static? Seems like I could want to use the new one in some cases, but not in others?????

          Why the static "constructors" on TokenStreamState? Why not just have constructors for that?

          Show
          Grant Ingersoll added a comment - Because mulitple people mentioned it would be better to merge TokenStream and Token into one class, They did? I only see Doug's comment, is there another one? Would be good to see the reasoning. Not that I'm against what you've done, just curious. Some questions (I've only glanced at the patch, so I still need to apply the patch): Should useNewAPI be static? Seems like I could want to use the new one in some cases, but not in others????? Why the static "constructors" on TokenStreamState? Why not just have constructors for that?
          Hide
          Michael Busch added a comment -

          They did? I only see Doug's comment, is there another one?

          Mike did in LUCENE-1426. I also don't see a reason anymore to have Token as a separate class. I think the TokenStreamState for the buffering usecases is more elegant...

          Should useNewAPI be static? Seems like I could want to use the new one in some cases, but not in others?????

          wow, a lot of question marks
          Good point. I made it static so that I could set it to true or false in LuceneTestCase.java to run all tests in both modes, but you're right, it should be per TokenStream instance. So maybe a static method for the default setting, and a non-static that can be used to override it?

          Why the static "constructors" on TokenStreamState? Why not just have constructors for that?

          There is not really a reason. I thought TokenStreamState.capture(TokenStream) would look more intuitive. (the word capture indicates what happens)

          Show
          Michael Busch added a comment - They did? I only see Doug's comment, is there another one? Mike did in LUCENE-1426 . I also don't see a reason anymore to have Token as a separate class. I think the TokenStreamState for the buffering usecases is more elegant... Should useNewAPI be static? Seems like I could want to use the new one in some cases, but not in others????? wow, a lot of question marks Good point. I made it static so that I could set it to true or false in LuceneTestCase.java to run all tests in both modes, but you're right, it should be per TokenStream instance. So maybe a static method for the default setting, and a non-static that can be used to override it? Why the static "constructors" on TokenStreamState? Why not just have constructors for that? There is not really a reason. I thought TokenStreamState.capture(TokenStream) would look more intuitive. (the word capture indicates what happens)
          Hide
          Grant Ingersoll added a comment -

          Good point. I made it static so that I could set it to true or false in LuceneTestCase.java to run all tests in both modes, but you're right, it should be per TokenStream instance. So maybe a static method for the default setting, and a non-static that can be used to override it?

          I don't think you need the static, just fold it into a constructor and have one that is a default, too.

          There is not really a reason. I thought TokenStreamState.capture(TokenStream) would look more intuitive. (the word capture indicates what happens)

          Constructors are about as intuitive as you can get for constructing a new object

          Mike did in LUCENE-1426. I also don't see a reason anymore to have Token as a separate class. I think the TokenStreamState for the buffering usecases is more elegant...

          Thanks. I see it now. I think one thing that might help is to rename Attribute to be TokenAttribute. For me, I'm just trying to think how to explain this stuff to new users and it doesn't feel as cut-and-dried as saying "A TokenStream produces Tokens. Tokens have attributes like offset, position, etc.". Now we are saying "A TokenStream produces Attributes. Attributes describe a token with offset, position, etc.". Not a big deal, though. I like where this is heading.

          I'm want to check it out for Solr, too, to see if there are any effects.

          Have you done any performance testing? I would think it would be faster. Out of curiosity, do you think this would allow us to implement something like described in the paper here: http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/

          Show
          Grant Ingersoll added a comment - Good point. I made it static so that I could set it to true or false in LuceneTestCase.java to run all tests in both modes, but you're right, it should be per TokenStream instance. So maybe a static method for the default setting, and a non-static that can be used to override it? I don't think you need the static, just fold it into a constructor and have one that is a default, too. There is not really a reason. I thought TokenStreamState.capture(TokenStream) would look more intuitive. (the word capture indicates what happens) Constructors are about as intuitive as you can get for constructing a new object Mike did in LUCENE-1426 . I also don't see a reason anymore to have Token as a separate class. I think the TokenStreamState for the buffering usecases is more elegant... Thanks. I see it now. I think one thing that might help is to rename Attribute to be TokenAttribute. For me, I'm just trying to think how to explain this stuff to new users and it doesn't feel as cut-and-dried as saying "A TokenStream produces Tokens. Tokens have attributes like offset, position, etc.". Now we are saying "A TokenStream produces Attributes. Attributes describe a token with offset, position, etc.". Not a big deal, though. I like where this is heading. I'm want to check it out for Solr, too, to see if there are any effects. Have you done any performance testing? I would think it would be faster. Out of curiosity, do you think this would allow us to implement something like described in the paper here: http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/
          Hide
          Michael McCandless added a comment -

          I like this new "idempotent" approach!

          I reviewed the patch. It looks good!:

          • It seems like AttributeSource is fairly generic – mabye it should
            go into util? It seems like we could eventually use the same
            approach for extensibility to index writing/reading APIs?
          • Performance – have you compared before/after simple tokenization
            speeds? Eg, tokenize all docs in Wikipedia w/ different
            analyzers. contrib/benchmark makes this easy now.
          • I assume we would statically default TokenStream.useNewAPI to
            true?
          • You are using a Token (perThreadlocalToken) and a
            BackwardsCompatibilityStream when indexing a NOT_ANALYZED Field,
            in DocInverterPerField – can you fix that code to use
            not-deprecated new stuff? EG maybe make a SingleTokenTokenStream
            or something that's re-used for this purpose?
          • Once we can set useNewAPI per TokenStream instance, what happens
            if someone builds up an analyzer chain that mixes old & new
            TokenStream/Filters? Are all permutations OK?
          • Are DocInverter.BackwardsCompatibilityStream and
            TokenStreamTestUtils.BackwardsCompatibleStream basically the same
            thing? If so, can we merge them? It seems like others may needs
            this (when mixing old/new analyzer chains – the question above).
          • Can you add back-compat-future-freedom warnings to these new APIs?
            (Ie "these are new & subject to change")
          • Do we really need the separate subclass TokenStreamState? Should
            we absorb its methods into AttributeSource?
          • In DocInverterPerField you check if the attrSource has posIncr &
            offset attrs, instead of just looking them up and getting the
            default instance if it wasn't already there. This then requires
            you to default posIncr & offsets in two places –
            DocInverterPerField & each of the attr classes. Wouldn't it be
            cleaner to always look them up, and let the default instance of
            each be the one place that holds the default?
          • Some files are missing copyright header. Also, the comments at the
            top of TermAttribute.java & TypeAttribute.java should be fixed
            (they are about pos incr now). Also there is a "must should"
            in the javadocs in TokenStrea.java.
          • On the new TokenStream.start method: is a TokenStream allowed to
            not allow more than 1 start invocation? Meaning, it cannot repeat
            itself. Just like we are grappling with on DocIdset.iterator()
            it'd be good to address this semantics up front.
          • Many unit tests had to be changed with this patch. Was this
            entirely due to avoiding the newly deprecated APIs, or, are there
            any tests that fail if they were not changed? (This is a simple
            check to run, actually – only apply your core changes, then run
            all tests). If it's the former (which it should be), maybe we
            should leave a few tests using the deprecated APIs to ensure we
            don't break them before 3.0?
          Show
          Michael McCandless added a comment - I like this new "idempotent" approach! I reviewed the patch. It looks good!: It seems like AttributeSource is fairly generic – mabye it should go into util? It seems like we could eventually use the same approach for extensibility to index writing/reading APIs? Performance – have you compared before/after simple tokenization speeds? Eg, tokenize all docs in Wikipedia w/ different analyzers. contrib/benchmark makes this easy now. I assume we would statically default TokenStream.useNewAPI to true? You are using a Token (perThreadlocalToken) and a BackwardsCompatibilityStream when indexing a NOT_ANALYZED Field, in DocInverterPerField – can you fix that code to use not-deprecated new stuff? EG maybe make a SingleTokenTokenStream or something that's re-used for this purpose? Once we can set useNewAPI per TokenStream instance, what happens if someone builds up an analyzer chain that mixes old & new TokenStream/Filters? Are all permutations OK? Are DocInverter.BackwardsCompatibilityStream and TokenStreamTestUtils.BackwardsCompatibleStream basically the same thing? If so, can we merge them? It seems like others may needs this (when mixing old/new analyzer chains – the question above). Can you add back-compat-future-freedom warnings to these new APIs? (Ie "these are new & subject to change") Do we really need the separate subclass TokenStreamState? Should we absorb its methods into AttributeSource? In DocInverterPerField you check if the attrSource has posIncr & offset attrs, instead of just looking them up and getting the default instance if it wasn't already there. This then requires you to default posIncr & offsets in two places – DocInverterPerField & each of the attr classes. Wouldn't it be cleaner to always look them up, and let the default instance of each be the one place that holds the default? Some files are missing copyright header. Also, the comments at the top of TermAttribute.java & TypeAttribute.java should be fixed (they are about pos incr now). Also there is a "must should" in the javadocs in TokenStrea.java. On the new TokenStream.start method: is a TokenStream allowed to not allow more than 1 start invocation? Meaning, it cannot repeat itself. Just like we are grappling with on DocIdset.iterator() it'd be good to address this semantics up front. Many unit tests had to be changed with this patch. Was this entirely due to avoiding the newly deprecated APIs, or, are there any tests that fail if they were not changed? (This is a simple check to run, actually – only apply your core changes, then run all tests). If it's the former (which it should be), maybe we should leave a few tests using the deprecated APIs to ensure we don't break them before 3.0?
          Hide
          Michael Busch added a comment -

          Thanks for the thorough review, Mike! I made most of the changes you suggested. However, I've some comments to some of your points:

          I assume we would statically default TokenStream.useNewAPI to
          true?

          I don't think it should default to true. How it currently works (and this answers also some of your other questions) is that one can't mix old and new streams and filters in the same chain. If someone enables the new API, then all streams and filters in one chain have to implement the new API. The reason is that you can't "simulate" the old API by calling the new methods in an efficient way. We would have to copy all values from the Token that next(Token) returns into the appropriate Attributes.
          This would slow down the ingestion performance and I think would affect backwards-compatibility: We guarantee that you can update from 2.4 to 2.9 without getting compile and runtime errors, but I think the performance should also not decrease significantly? That's the main reason why I left two implementations in all core streams and filters, for next(Token) and incrementToken().

          Are DocInverter.BackwardsCompatibilityStream and
          TokenStreamTestUtils.BackwardsCompatibleStream basically the same
          thing? If so, can we merge them? It seems like others may needs
          this (when mixing old/new analyzer chains - the question above).

          Sorry for the silly naming of these classes. They are not the same. The one in DocInverter is basically a wrapper with Attributes around an old Token. This is used so that almost all consumers in the indexer package can already use the new API in an efficient way.
          The one in the test package however is used, so that TokenStream and -Filter implementations in our testcases don't have to have implementations for both the old and new API. If the old next(Token) is called, then it calls incrementToken() and copies over the values from the Attributes to the Token. Since performance is not critical in testcases, I decided to take this approach to reduce code duplication in the tests.

          Many unit tests had to be changed with this patch. Was this entirely due to avoiding the newly deprecated APIs, or, are there any tests that fail if they were not changed? (This is a simple check to run, actually - only apply your core changes, then run all tests). If it's the former (which it should be), maybe we should leave a few tests using the deprecated APIs to ensure we don't break them before 3.0?

          That's actually the reason why I want to keep the static setUseNewAPI() method. When I test this patch I run all tests twice, in both modes. That way I can be sure to not break backwards compatibility.

          On the new TokenStream.start method: is a TokenStream allowed to
          not allow more than 1 start invocation? Meaning, it cannot repeat
          itself. Just like we are grappling with on DocIdset.iterator()
          it'd be good to address this semantics up front.

          Not sure I follow you here. Could you elaborate?

          Show
          Michael Busch added a comment - Thanks for the thorough review, Mike! I made most of the changes you suggested. However, I've some comments to some of your points: I assume we would statically default TokenStream.useNewAPI to true? I don't think it should default to true. How it currently works (and this answers also some of your other questions) is that one can't mix old and new streams and filters in the same chain. If someone enables the new API, then all streams and filters in one chain have to implement the new API. The reason is that you can't "simulate" the old API by calling the new methods in an efficient way. We would have to copy all values from the Token that next(Token) returns into the appropriate Attributes. This would slow down the ingestion performance and I think would affect backwards-compatibility: We guarantee that you can update from 2.4 to 2.9 without getting compile and runtime errors, but I think the performance should also not decrease significantly? That's the main reason why I left two implementations in all core streams and filters, for next(Token) and incrementToken(). Are DocInverter.BackwardsCompatibilityStream and TokenStreamTestUtils.BackwardsCompatibleStream basically the same thing? If so, can we merge them? It seems like others may needs this (when mixing old/new analyzer chains - the question above). Sorry for the silly naming of these classes. They are not the same. The one in DocInverter is basically a wrapper with Attributes around an old Token. This is used so that almost all consumers in the indexer package can already use the new API in an efficient way. The one in the test package however is used, so that TokenStream and -Filter implementations in our testcases don't have to have implementations for both the old and new API. If the old next(Token) is called, then it calls incrementToken() and copies over the values from the Attributes to the Token. Since performance is not critical in testcases, I decided to take this approach to reduce code duplication in the tests. Many unit tests had to be changed with this patch. Was this entirely due to avoiding the newly deprecated APIs, or, are there any tests that fail if they were not changed? (This is a simple check to run, actually - only apply your core changes, then run all tests). If it's the former (which it should be), maybe we should leave a few tests using the deprecated APIs to ensure we don't break them before 3.0? That's actually the reason why I want to keep the static setUseNewAPI() method. When I test this patch I run all tests twice, in both modes. That way I can be sure to not break backwards compatibility. On the new TokenStream.start method: is a TokenStream allowed to not allow more than 1 start invocation? Meaning, it cannot repeat itself. Just like we are grappling with on DocIdset.iterator() it'd be good to address this semantics up front. Not sure I follow you here. Could you elaborate?
          Hide
          Michael McCandless added a comment -

          ...one can't mix old and new streams and filters in the same chain

          Ahh, I see. So on upgrading if someone has a chain involving an
          external TokenFilter that does not implement the new API, then the
          whole chain should be downgraded to the old API until that filter is
          upgraded.

          OK I agree we should default it to false, then in 3.0 hardwire it to
          true (and remove all the deprecated methods).

          We guarantee that you can update from 2.4 to 2.9 without getting compile and runtime errors, but I think the performance should also not decrease significantly?

          I agree: I think this should be (is?) part of the back compat promise.

          They are not the same.

          OK this makes sense!

          When I test this patch I run all tests twice, in both modes. That way I can be sure to not break backwards compatibility.

          But: what if you revert all changes to all tests, and leave useNewAPI
          false? Nothing should fail, right? (This is the "true" back compat
          test, I think).

          Not sure I follow you here. Could you elaborate?

          In LUCENE-1427, we were discussing whether a DocIdSet should "declare"
          whether it can be iterated over more than once, or not, so that things
          that know they need to consume it more than once would know whether to
          use an intermediate cache. I'm not sure we'd need it, here, but since
          we are changing the API now I thought we should think about it.
          But... we can (and should) do this as a separate issue, if and when it
          proves out to a real use case, so let's take it off the table for now.

          Show
          Michael McCandless added a comment - ...one can't mix old and new streams and filters in the same chain Ahh, I see. So on upgrading if someone has a chain involving an external TokenFilter that does not implement the new API, then the whole chain should be downgraded to the old API until that filter is upgraded. OK I agree we should default it to false, then in 3.0 hardwire it to true (and remove all the deprecated methods). We guarantee that you can update from 2.4 to 2.9 without getting compile and runtime errors, but I think the performance should also not decrease significantly? I agree: I think this should be (is?) part of the back compat promise. They are not the same. OK this makes sense! When I test this patch I run all tests twice, in both modes. That way I can be sure to not break backwards compatibility. But: what if you revert all changes to all tests, and leave useNewAPI false? Nothing should fail, right? (This is the "true" back compat test, I think). Not sure I follow you here. Could you elaborate? In LUCENE-1427 , we were discussing whether a DocIdSet should "declare" whether it can be iterated over more than once, or not, so that things that know they need to consume it more than once would know whether to use an intermediate cache. I'm not sure we'd need it, here, but since we are changing the API now I thought we should think about it. But... we can (and should) do this as a separate issue, if and when it proves out to a real use case, so let's take it off the table for now.
          Hide
          Doug Cutting added a comment -

          > But: what if you revert all changes to all tests [ ... ]
          > This is the "true" back compat test, I think

          Perhaps we could formalize this. We could specify a subversion URL in build.xml that points to a release tag, the tag of the oldest release we need to be compatible with. We add a test that retrieves the tests from this tag and compiles and runs them.

          Show
          Doug Cutting added a comment - > But: what if you revert all changes to all tests [ ... ] > This is the "true" back compat test, I think Perhaps we could formalize this. We could specify a subversion URL in build.xml that points to a release tag, the tag of the oldest release we need to be compatible with. We add a test that retrieves the tests from this tag and compiles and runs them.
          Hide
          Michael McCandless added a comment -

          Perhaps we could formalize this. We could specify a subversion URL in build.xml that points to a release tag, the tag of the oldest release we need to be compatible with. We add a test that retrieves the tests from this tag and compiles and runs them.

          +1

          Hopefully someone who understands ant/build.xml very well (not me!) will do this

          Once we have that working, we could then make it part of the nightly build.

          Show
          Michael McCandless added a comment - Perhaps we could formalize this. We could specify a subversion URL in build.xml that points to a release tag, the tag of the oldest release we need to be compatible with. We add a test that retrieves the tests from this tag and compiles and runs them. +1 Hopefully someone who understands ant/build.xml very well (not me!) will do this Once we have that working, we could then make it part of the nightly build.
          Hide
          Michael Busch added a comment -

          Perhaps we could formalize this. We could specify a subversion URL in build.xml that points to a release tag, the tag of the oldest release we need to be compatible with. We add a test that retrieves the tests from this tag and compiles and runs them.

          I like this approach too. The only concern I have is about the package-private APIs. As we recently discussed they can always change, so if any test cases use package-private APIs that changed in trunk since the last release, then the old testcases would not even compile.

          However, I worked on a patch for this and will open a new issue soon.

          Show
          Michael Busch added a comment - Perhaps we could formalize this. We could specify a subversion URL in build.xml that points to a release tag, the tag of the oldest release we need to be compatible with. We add a test that retrieves the tests from this tag and compiles and runs them. I like this approach too. The only concern I have is about the package-private APIs. As we recently discussed they can always change, so if any test cases use package-private APIs that changed in trunk since the last release, then the old testcases would not even compile. However, I worked on a patch for this and will open a new issue soon.
          Hide
          Grant Ingersoll added a comment -

          Good idea, Doug, on the back-compat testing. Something we should definitely add.

          Michael and I did some performance tests at ApacheCon on this, and it seems like it is as fast as the old way, but we should probably formalize this by adding to the Tokenize algorithm in the benchmarker so it is easier for people to verify it.

          I still don't see the point to the

          public static TokenStreamState capture(TokenStream from) 

          Just use the constructor. Wrapping a constructor in a static doesn't gain you anything except an extra function call.

          Show
          Grant Ingersoll added a comment - Good idea, Doug, on the back-compat testing. Something we should definitely add. Michael and I did some performance tests at ApacheCon on this, and it seems like it is as fast as the old way, but we should probably formalize this by adding to the Tokenize algorithm in the benchmarker so it is easier for people to verify it. I still don't see the point to the public static TokenStreamState capture(TokenStream from) Just use the constructor. Wrapping a constructor in a static doesn't gain you anything except an extra function call.
          Hide
          Michael Busch added a comment -
          • In DocInverterPerField you check if the attrSource has posIncr &
            offset attrs, instead of just looking them up and getting the
            default instance if it wasn't already there. This then requires
            you to default posIncr & offsets in two places -
            DocInverterPerField & each of the attr classes. Wouldn't it be
            cleaner to always look them up, and let the default instance of
            each be the one place that holds the default?

          So here's one thing about the new API that is not entirely clear to me yet. Let's say the consumer, in this case DocInverterPerField, wants to use the offset. You are right, it's not desirable to define default values in two places (the consumer and the attribute). Now the other alternative is to call addAttribute(OffsetAttribute.class) in the consumer. If no offset attribute is present, this will just add a default instance and the consumer can use its default value. So far so good. But what if there is e. g. a TokenFilter in the chain that checks input.hasAttribute(OffsetAttribute.class) and does something it would not do if that check returned false? It seems "something" is not clearly defined here yet. I think the "something" is the difference between a producer (TokenStream/Filter) adding an attribute vs. a consumer. Thoughts?

          Hmm or maybe in this case the consumer is not behaving correctly. If there is no offset attribute, then the consumer should not have to write any data structures that use an offset, right? That's why you currently have the start() method in TermVectorTermsWriterPerField that checks if offsets need to be written or not.
          I think the issue here is that one consumer, the DocInverterPerField, deals with one value, the offset, that multiple subconsumers might want to use. That's why we don't want to throw an exception in DocInverterPerField if OffsetAttribute is not present, because it can't know if this is really a misconfiguration or not. Only the actual subconsumer (in this case TermVectorTermsWriterPerField) is able to decide that. For example if someone writes a new custom consumer that can only work if OffsetAttribute is present, then we would probably want that consumer to throw an exception if it's not present. But if the DocInverterPerField adds the OffsetAttribute to be able to use it's default value, we would not get the exception.

          But maybe all this is fine and we should just say we strictly separate the Attributes from the configuration, which means that the knowledge about the presence or absence of an Attribute may not be used by anybody (filter or consumer) to determine anything about the configuration. This is how the TermVectors currently work. The configuration (whether or not to store positions and/or offsets) is stored in the Field, the data is carried by the OffsetAttribute. If we decide to define the API this way, do we even need AttributeSource.hasAttribute() and getAttribute()? Or wouldn't addAttribute() be sufficient then?

          Show
          Michael Busch added a comment - In DocInverterPerField you check if the attrSource has posIncr & offset attrs, instead of just looking them up and getting the default instance if it wasn't already there. This then requires you to default posIncr & offsets in two places - DocInverterPerField & each of the attr classes. Wouldn't it be cleaner to always look them up, and let the default instance of each be the one place that holds the default? So here's one thing about the new API that is not entirely clear to me yet. Let's say the consumer, in this case DocInverterPerField, wants to use the offset. You are right, it's not desirable to define default values in two places (the consumer and the attribute). Now the other alternative is to call addAttribute(OffsetAttribute.class) in the consumer. If no offset attribute is present, this will just add a default instance and the consumer can use its default value. So far so good. But what if there is e. g. a TokenFilter in the chain that checks input.hasAttribute(OffsetAttribute.class) and does something it would not do if that check returned false? It seems "something" is not clearly defined here yet. I think the "something" is the difference between a producer (TokenStream/Filter) adding an attribute vs. a consumer. Thoughts? Hmm or maybe in this case the consumer is not behaving correctly. If there is no offset attribute, then the consumer should not have to write any data structures that use an offset, right? That's why you currently have the start() method in TermVectorTermsWriterPerField that checks if offsets need to be written or not. I think the issue here is that one consumer, the DocInverterPerField, deals with one value, the offset, that multiple subconsumers might want to use. That's why we don't want to throw an exception in DocInverterPerField if OffsetAttribute is not present, because it can't know if this is really a misconfiguration or not. Only the actual subconsumer (in this case TermVectorTermsWriterPerField) is able to decide that. For example if someone writes a new custom consumer that can only work if OffsetAttribute is present, then we would probably want that consumer to throw an exception if it's not present. But if the DocInverterPerField adds the OffsetAttribute to be able to use it's default value, we would not get the exception. But maybe all this is fine and we should just say we strictly separate the Attributes from the configuration, which means that the knowledge about the presence or absence of an Attribute may not be used by anybody (filter or consumer) to determine anything about the configuration. This is how the TermVectors currently work. The configuration (whether or not to store positions and/or offsets) is stored in the Field, the data is carried by the OffsetAttribute. If we decide to define the API this way, do we even need AttributeSource.hasAttribute() and getAttribute()? Or wouldn't addAttribute() be sufficient then?
          Hide
          Michael Busch added a comment -

          It seems "something" is not clearly defined here yet.

          OK, I added this to the javadocs of TokenStream to make things clear:

          The workflow of the new TokenStream API is as follows:

          1. Instantiation of TokenStream/TokenFilters
          2. The consumer calls TokenStream#start() which triggers calls to #initialize() of the stream and all filters in the chain.
          3. Filters can use #initialize() to get or add attributes and store local references to the attribute.
          4. After #start() returns the consumer retrieves attributes from the stream and stores local references to all attributes it wants to access
          5. The consumer calls #incrementToken() until it returns false and consumes the attributes after each call.

          To make sure that filters and consumers know which attributes are available
          the attributes must be added in the #initialize() call. Filters and
          consumers are not required to check for availability of attributes in #incrementToken().

          Show
          Michael Busch added a comment - It seems "something" is not clearly defined here yet. OK, I added this to the javadocs of TokenStream to make things clear: The workflow of the new TokenStream API is as follows: Instantiation of TokenStream/TokenFilters The consumer calls TokenStream#start() which triggers calls to #initialize() of the stream and all filters in the chain. Filters can use #initialize() to get or add attributes and store local references to the attribute. After #start() returns the consumer retrieves attributes from the stream and stores local references to all attributes it wants to access The consumer calls #incrementToken() until it returns false and consumes the attributes after each call. To make sure that filters and consumers know which attributes are available the attributes must be added in the #initialize() call. Filters and consumers are not required to check for availability of attributes in #incrementToken().
          Hide
          Michael Busch added a comment -

          Changes in this patch:

          • updated patch to current trunk
          • moved AttributeSource and Attribute to o.a.l.util
          • removed TokenStreamState and moved captureState() and restoreState() into AttributeSource
          • added non-static setUseAPI() method; added warning that mixing of old and new streams/filters is not supported
          • DocInverterPerField now uses only non-deprecated classes for NOT_ANALYZED fields
          • Added warnings saying that the new APIs might change in the future
          • DocInverterPerField now adds PositionIncrementAttribute and OffsetAttribute in case they are not present in the stream and thus uses the default values defined in the attributes
          • added missing license headers and fixed javadocs
          • Removed TokenStreamTestUtils and references to Token from all unit tests (except the now also deprecated TestToken); LuceneTestCase enables the new API in setup() for all tests, so the current tests only test the new API; backwards-compatibility is tested with "ant test-tag" (see LUCENE-1440)
          • fixed some small bugs that I encountered after changing the unit test

          All current and 2.4 tests pass.

          Show
          Michael Busch added a comment - Changes in this patch: updated patch to current trunk moved AttributeSource and Attribute to o.a.l.util removed TokenStreamState and moved captureState() and restoreState() into AttributeSource added non-static setUseAPI() method; added warning that mixing of old and new streams/filters is not supported DocInverterPerField now uses only non-deprecated classes for NOT_ANALYZED fields Added warnings saying that the new APIs might change in the future DocInverterPerField now adds PositionIncrementAttribute and OffsetAttribute in case they are not present in the stream and thus uses the default values defined in the attributes added missing license headers and fixed javadocs Removed TokenStreamTestUtils and references to Token from all unit tests (except the now also deprecated TestToken); LuceneTestCase enables the new API in setup() for all tests, so the current tests only test the new API; backwards-compatibility is tested with "ant test-tag" (see LUCENE-1440 ) fixed some small bugs that I encountered after changing the unit test All current and 2.4 tests pass.
          Hide
          Michael McCandless added a comment -

          Looks good!

          I'm seeing this failure (in test I just committed this AM). I think
          it's OK, because the new API is enabled for all tests and I'm using
          the old API with that analyzer?

              [junit] Testcase: testExclusiveLowerNull(org.apache.lucene.search.TestRangeQuery):	Caused an ERROR
              [junit] This token does not have the attribute 'class org.apache.lucene.analysis.tokenattributes.TermAttribute'.
              [junit] java.lang.IllegalArgumentException: This token does not have the attribute 'class org.apache.lucene.analysis.tokenattributes.TermAttribute'.
              [junit] 	at org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:124)
              [junit] 	at org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:252)
              [junit] 	at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:144)
              [junit] 	at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
              [junit] 	at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
              [junit] 	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:760)
              [junit] 	at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:738)
              [junit] 	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
              [junit] 	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
              [junit] 	at org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:304)
              [junit] 	at org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:287)
              [junit] 	at org.apache.lucene.search.TestRangeQuery.testExclusiveLowerNull(TestRangeQuery.java:315)
          

          Some other random questions:

          • I'm a little confused by 'start' and 'initialize' in TokenStream.
            DocInverterPerField (consumer of this API) calls
            analyzer.reusableTokenStream to get a token stream. Then it calls
            stream.reset(). Then it calls stream.start() which then calls
            stream.initialize(). Can we consolidate these (there are 4 places
            that we call to "start" the stream now)?
            .
            EG why can't analyzer.reusableTokenStream() do the init internally
            in the new API? (Also, in StopFilter, initialize() sets termAtt &
            posIncrAtt, but I would think this only needs to happen once when
            that TokenFilter is created? BackCompatTokenStream add the attrs
            in its ctor, which seems better.).
          • BackCompatTokenStream is calling attributes.put directly but all
            others use super's addAttribute.
          • Why is BackCompatTokenStream overriding so many methods? EG
            has/get/addAttribute – won't super do the same thing?
          • Maybe add reasons to some of the asserts, eg StopFilter has
            "assert termAtt != null", so maybe append to that ": initialize()
            wasn't called".
          Show
          Michael McCandless added a comment - Looks good! I'm seeing this failure (in test I just committed this AM). I think it's OK, because the new API is enabled for all tests and I'm using the old API with that analyzer? [junit] Testcase: testExclusiveLowerNull(org.apache.lucene.search.TestRangeQuery): Caused an ERROR [junit] This token does not have the attribute 'class org.apache.lucene.analysis.tokenattributes.TermAttribute'. [junit] java.lang.IllegalArgumentException: This token does not have the attribute 'class org.apache.lucene.analysis.tokenattributes.TermAttribute'. [junit] at org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:124) [junit] at org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:252) [junit] at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:144) [junit] at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36) [junit] at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:760) [junit] at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:738) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013) [junit] at org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:304) [junit] at org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:287) [junit] at org.apache.lucene.search.TestRangeQuery.testExclusiveLowerNull(TestRangeQuery.java:315) Some other random questions: I'm a little confused by 'start' and 'initialize' in TokenStream. DocInverterPerField (consumer of this API) calls analyzer.reusableTokenStream to get a token stream. Then it calls stream.reset(). Then it calls stream.start() which then calls stream.initialize(). Can we consolidate these (there are 4 places that we call to "start" the stream now)? . EG why can't analyzer.reusableTokenStream() do the init internally in the new API? (Also, in StopFilter, initialize() sets termAtt & posIncrAtt, but I would think this only needs to happen once when that TokenFilter is created? BackCompatTokenStream add the attrs in its ctor, which seems better.). BackCompatTokenStream is calling attributes.put directly but all others use super's addAttribute. Why is BackCompatTokenStream overriding so many methods? EG has/get/addAttribute – won't super do the same thing? Maybe add reasons to some of the asserts, eg StopFilter has "assert termAtt != null", so maybe append to that ": initialize() wasn't called".
          Hide
          Michael Busch added a comment -

          Looks good!

          Thanks for reviewing ... again!

          I'm a little confused by 'start' and 'initialize' in TokenStream.

          Good question. The only reason why I added two separate methods here was to enforce that TokenFilter#start() always calls input.start() first. I think otherwise this will be a pitfall for people who want to implement their own filters and override start(). (happened to me when I modified the tests a couple of times)
          On the other hand I agree that Three methods (start(), initialize(), reset()) are confusing. I guess we can deprecate reset() in the future and have start() rewind the stream if supported by the stream. (I think you mentioned a boolean method could return true if rewind is supported?)
          So I'd be ok with collapsing start/initialize into one method if you prefer that. I'd just have to add an explicit warning to the javadocs in TokenFilter.

          I'm seeing this failure (in test I just committed this AM). I think
          it's OK, because the new API is enabled for all tests and I'm using
          the old API with that analyzer?

          Yeah, I've to modify the new test to use the new API.

          • BackCompatTokenStream is calling attributes.put directly but all
            others use super's addAttribute.
          • Why is BackCompatTokenStream overriding so many methods? EG
            has/get/addAttribute - won't super do the same thing?

          I had a different implementation of BCTS, so this is left-over. Of course you're right, I can simplify that class.

          Maybe add reasons to some of the asserts, eg StopFilter has
          "assert termAtt != null", so maybe append to that ": initialize()
          wasn't called".

          Yeah good idea, will do.

          Show
          Michael Busch added a comment - Looks good! Thanks for reviewing ... again! I'm a little confused by 'start' and 'initialize' in TokenStream. Good question. The only reason why I added two separate methods here was to enforce that TokenFilter#start() always calls input.start() first. I think otherwise this will be a pitfall for people who want to implement their own filters and override start(). (happened to me when I modified the tests a couple of times) On the other hand I agree that Three methods (start(), initialize(), reset()) are confusing. I guess we can deprecate reset() in the future and have start() rewind the stream if supported by the stream. (I think you mentioned a boolean method could return true if rewind is supported?) So I'd be ok with collapsing start/initialize into one method if you prefer that. I'd just have to add an explicit warning to the javadocs in TokenFilter. I'm seeing this failure (in test I just committed this AM). I think it's OK, because the new API is enabled for all tests and I'm using the old API with that analyzer? Yeah, I've to modify the new test to use the new API. BackCompatTokenStream is calling attributes.put directly but all others use super's addAttribute. Why is BackCompatTokenStream overriding so many methods? EG has/get/addAttribute - won't super do the same thing? I had a different implementation of BCTS, so this is left-over. Of course you're right, I can simplify that class. Maybe add reasons to some of the asserts, eg StopFilter has "assert termAtt != null", so maybe append to that ": initialize() wasn't called". Yeah good idea, will do.
          Hide
          Michael McCandless added a comment -

          So I'd be ok with collapsing start/initialize into one method if you prefer that.

          Or, could we just continue to use reset() for this purpose? I think it's ok via Javadocs to state clearly that if you override reset you have to be sure to call super.reset().

          And then tokenizers like CharTokenizer should not addAttribute on every reset/initialize, right? They should just do it once in their ctor?

          Show
          Michael McCandless added a comment - So I'd be ok with collapsing start/initialize into one method if you prefer that. Or, could we just continue to use reset() for this purpose? I think it's ok via Javadocs to state clearly that if you override reset you have to be sure to call super.reset(). And then tokenizers like CharTokenizer should not addAttribute on every reset/initialize, right? They should just do it once in their ctor?
          Hide
          Michael Busch added a comment -

          Or, could we just continue to use reset() for this purpose? I think it's ok via Javadocs to state clearly that if you override reset you have to be sure to call super.reset().

          And then tokenizers like CharTokenizer should not addAttribute on every reset/initialize, right? They should just do it once in their ctor?

          OK I like that! Let's do it that way, it's simpler and less confusing.

          I'm making the change now, then hopefully the patch is ready for committing...

          Show
          Michael Busch added a comment - Or, could we just continue to use reset() for this purpose? I think it's ok via Javadocs to state clearly that if you override reset you have to be sure to call super.reset(). And then tokenizers like CharTokenizer should not addAttribute on every reset/initialize, right? They should just do it once in their ctor? OK I like that! Let's do it that way, it's simpler and less confusing. I'm making the change now, then hopefully the patch is ready for committing...
          Hide
          Michael Busch added a comment -

          I removed start() and initialize().
          I also fixed some javadocs and added a pretty long section to package.html in the analysis package, describing the new APIs.

          All trunk and 2.4 tests pass.

          I guess I'll wait until LUCENE-1448 is committed (and update this patch again... ). I'll try to review 1448 later today (need to leave for a few hours now...)

          Show
          Michael Busch added a comment - I removed start() and initialize(). I also fixed some javadocs and added a pretty long section to package.html in the analysis package, describing the new APIs. All trunk and 2.4 tests pass. I guess I'll wait until LUCENE-1448 is committed (and update this patch again... ). I'll try to review 1448 later today (need to leave for a few hours now...)
          Hide
          Michael Busch added a comment -

          OK we decided to commit this before LUCENE-1448.

          I'll wait another day and then commit this if nobody objects.

          Show
          Michael Busch added a comment - OK we decided to commit this before LUCENE-1448 . I'll wait another day and then commit this if nobody objects.
          Hide
          Michael Busch added a comment -

          Committed revision 718798.

          Thanks everyone who helped here with reviews and comments!

          Show
          Michael Busch added a comment - Committed revision 718798. Thanks everyone who helped here with reviews and comments!
          Hide
          Hoss Man added a comment -

          FWIW: In previous versions of Lucene, a class like this would have been legal...

          public final class DummyStream extends TokenFilter {
            int count = 5;
            public DummyStream() {  super(null);  }
            public final Token next() {
              return (0 < count--) ? new Token("YES", 0, 0) : null;
            }
            public void reset() throws IOException {count = 5}
            public void close() throws IOException {}
          }
          

          ...there wouldn't have been much reason to write a subclass like this (better to subclass TokenStream directly for this type of usecase), but it would have worked. with the new AttributeSource class, TokenFilter actually cares about the TokenStream constructor arg during construction, and a constructor like this will cause a NullPointerException.

          It's an esoteric enough possibility that we probably don't need to worry about it, but i'm documenting it for posterity. (I only noticed because Solr had a test case where it was constructing a stock TokenFilter using a null "input" arg just to then test the object in other ways)

          if anyone does run into a problem with something like this, your best solution is probably to subclass TokenStream directly, it has a no arg constructor.

          for searching...

          java.lang.NullPointerException
          	at org.apache.lucene.util.AttributeSource.<init>(AttributeSource.java:69)
          	at org.apache.lucene.analysis.TokenStream.<init>(TokenStream.java:91)
          	at org.apache.lucene.analysis.TokenFilter.<init>(TokenFilter.java:42)
          	...
          
          Show
          Hoss Man added a comment - FWIW: In previous versions of Lucene, a class like this would have been legal... public final class DummyStream extends TokenFilter { int count = 5; public DummyStream() { super ( null ); } public final Token next() { return (0 < count--) ? new Token( "YES" , 0, 0) : null ; } public void reset() throws IOException {count = 5} public void close() throws IOException {} } ...there wouldn't have been much reason to write a subclass like this (better to subclass TokenStream directly for this type of usecase), but it would have worked. with the new AttributeSource class, TokenFilter actually cares about the TokenStream constructor arg during construction, and a constructor like this will cause a NullPointerException. It's an esoteric enough possibility that we probably don't need to worry about it, but i'm documenting it for posterity. (I only noticed because Solr had a test case where it was constructing a stock TokenFilter using a null "input" arg just to then test the object in other ways) if anyone does run into a problem with something like this, your best solution is probably to subclass TokenStream directly, it has a no arg constructor. for searching... java.lang.NullPointerException at org.apache.lucene.util.AttributeSource.<init>(AttributeSource.java:69) at org.apache.lucene.analysis.TokenStream.<init>(TokenStream.java:91) at org.apache.lucene.analysis.TokenFilter.<init>(TokenFilter.java:42) ...
          Hide
          Grant Ingersoll added a comment -

          Outstanding:

          • contrib streams and filters

          What happened to fixing the contribs? Seems incomplete without it.

          Show
          Grant Ingersoll added a comment - Outstanding: contrib streams and filters What happened to fixing the contribs? Seems incomplete without it.
          Hide
          Michael Busch added a comment -
          Show
          Michael Busch added a comment - See LUCENE-1460 .

            People

            • Assignee:
              Michael Busch
              Reporter:
              Michael Busch
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development