Lucene - Core
  1. Lucene - Core
  2. LUCENE-1693

AttributeSource/TokenStream API improvements

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This patch makes the following improvements to AttributeSource and
      TokenStream/Filter:

      • introduces interfaces for all Attributes. The corresponding
        implementations have the postfix 'Impl', e.g. TermAttribute and
        TermAttributeImpl. AttributeSource now has a factory for creating
        the Attribute instances; the default implementation looks for
        implementing classes with the postfix 'Impl'. Token now implements
        all 6 TokenAttribute interfaces.
      • new method added to AttributeSource:
        addAttributeImpl(AttributeImpl). Using reflection it walks up in the
        class hierarchy of the passed in object and finds all interfaces
        that the class or superclasses implement and that extend the
        Attribute interface. It then adds the interface->instance mappings
        to the attribute map for each of the found interfaces.
      • removes the set/getUseNewAPI() methods (including the standard
        ones). Instead it is now enough to only implement the new API,
        if one old TokenStream implements still the old API (next()/next(Token)),
        it is wrapped automatically. The delegation path is determined via
        reflection (the patch determines, which of the three methods was
        overridden).
      • Token is no longer deprecated, instead it implements all 6 standard
        token interfaces (see above). The wrapper for next() and next(Token)
        uses this, to automatically map all attribute interfaces to one
        TokenWrapper instance (implementing all 6 interfaces), that contains
        a Token instance. next() and next(Token) exchange the inner Token
        instance as needed. For the new incrementToken(), only one
        TokenWrapper instance is visible, delegating to the currect reusable
        Token. This API also preserves custom Token subclasses, that maybe
        created by very special token streams (see example in Backwards-Test).
      • AttributeImpl now has a default implementation of toString that uses
        reflection to print out the values of the attributes in a default
        formatting. This makes it a bit easier to implement AttributeImpl,
        because toString() was declared abstract before.
      • Cloning is now done much more efficiently in
        captureState. The method figures out which unique AttributeImpl
        instances are contained as values in the attributes map, because
        those are the ones that need to be cloned. It creates a single
        linked list that supports deep cloning (in the inner class
        AttributeSource.State). AttributeSource keeps track of when this
        state changes, i.e. whenever new attributes are added to the
        AttributeSource. Only in that case will captureState recompute the
        state, otherwise it will simply clone the precomputed state and
        return the clone. restoreState(AttributeSource.State) walks the
        linked list and uses the copyTo() method of AttributeImpl to copy
        all values over into the attribute that the source stream
        (e.g. SinkTokenizer) uses.
      • Tee- and SinkTokenizer were deprecated, because they use
        Token instances for caching. This is not compatible to the new API
        using AttributeSource.State objects. You can still use the old
        deprecated ones, but new features provided by new Attribute types
        may get lost in the chain. A replacement is a new TeeSinkTokenFilter,
        which has a factory to create new Sink instances, that have compatible
        attributes. Sink instances created by one Tee can also be added to
        another Tee, as long as the attribute implementations are compatible
        (it is not possible to add a sink from a tee using one Token instance
        to a tee using the six separate attribute impls). In this case UOE is thrown.

      The cloning performance can be greatly improved if not multiple
      AttributeImpl instances are used in one TokenStream. A user can
      e.g. simply add a Token instance to the stream instead of the individual
      attributes. Or the user could implement a subclass of AttributeImpl that
      implements exactly the Attribute interfaces needed. I think this
      should be considered an expert API (addAttributeImpl), as this manual
      optimization is only needed if cloning performance is crucial. I ran
      some quick performance tests using Tee/Sink tokenizers (which do
      cloning) and the performance was roughly 20% faster with the new
      API. I'll run some more performance tests and post more numbers then.

      Note also that when we add serialization to the Attributes, e.g. for
      supporting storing serialized TokenStreams in the index, then the
      serialization should benefit even significantly more from the new API
      than cloning.

      This issue contains one backwards-compatibility break:
      TokenStreams/Filters/Tokenizers should normally be final
      (see LUCENE-1753 for the explaination). Some of these core classes are
      not final and so one could override the next() or next(Token) methods.
      In this case, the backwards-wrapper would automatically use
      incrementToken(), because it is implemented, so the overridden
      method is never called. To prevent users from errors not visible
      during compilation or testing (the streams just behave wrong),
      this patch makes all implementation methods final
      (next(), next(Token), incrementToken()), whenever the class
      itsself is not final. This is a BW break, but users will clearly see,
      that they have done something unsupoorted and should better
      create a custom TokenFilter with their additional implementation
      (instead of extending a core implementation).

      For further changing contrib token streams the following procedere should be used:

      • rewrite and replace next(Token)/next() implementations by new API
      • if the class is final, no next(Token)/next() methods needed (must be removed!!!)
      • if the class is non-final add the following methods to the class:
              /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should
               * not be overridden. Delegates to the backwards compatibility layer. */
              public final Token next(final Token reusableToken) throws java.io.IOException {
                return super.next(reusableToken);
              }
        
              /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should
               * not be overridden. Delegates to the backwards compatibility layer. */
              public final Token next() throws java.io.IOException {
                return super.next();
              }
        

        Also the incrementToken() method must be final in this case
        (and the new method end() of LUCENE-1448)

      1. lucene-1693.patch
        111 kB
        Michael Busch
      2. LUCENE-1693.patch
        96 kB
        Uwe Schindler
      3. TestCompatibility.java
        3 kB
        Michael Busch
      4. LUCENE-1693.patch
        106 kB
        Uwe Schindler
      5. LUCENE-1693.patch
        109 kB
        Uwe Schindler
      6. TestCompatibility.java
        3 kB
        Michael Busch
      7. LUCENE-1693.patch
        109 kB
        Uwe Schindler
      8. LUCENE-1693.patch
        126 kB
        Uwe Schindler
      9. TestCompatibility.java
        8 kB
        Michael Busch
      10. LUCENE-1693.patch
        125 kB
        Uwe Schindler
      11. TestCompatibility.java
        8 kB
        Michael Busch
      12. LUCENE-1693.patch
        140 kB
        Uwe Schindler
      13. LUCENE-1693.patch
        142 kB
        Uwe Schindler
      14. LUCENE-1693.patch
        172 kB
        Uwe Schindler
      15. TestAPIBackwardsCompatibility.java
        17 kB
        Michael Busch
      16. LUCENE-1693.patch
        172 kB
        Uwe Schindler
      17. LUCENE-1693.patch
        172 kB
        Uwe Schindler
      18. lucene-1693.patch
        174 kB
        Michael Busch
      19. LUCENE-1693.patch
        196 kB
        Uwe Schindler
      20. lucene-1693.patch
        194 kB
        Michael Busch
      21. LUCENE-1693.patch
        199 kB
        Uwe Schindler
      22. PerfTest3.java
        4 kB
        Uwe Schindler
      23. LUCENE-1693.patch
        211 kB
        Uwe Schindler
      24. LUCENE-1693.patch
        219 kB
        Uwe Schindler
      25. lucene-1693.patch
        224 kB
        Michael Busch
      26. LUCENE-1693-TokenizerAttrFactory.patch
        1 kB
        Uwe Schindler

        Issue Links

          Activity

          Michael Busch created issue -
          Michael Busch made changes -
          Field Original Value New Value
          Attachment lucene-1693.patch [ 12410775 ]
          Grant Ingersoll made changes -
          Link This issue relates to LUCENE-1697 [ LUCENE-1697 ]
          Grant Ingersoll made changes -
          Link This issue relates to LUCENE-1695 [ LUCENE-1695 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12410874 ]
          Michael Busch made changes -
          Attachment TestCompatibility.java [ 12410906 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12410926 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12410926 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12410927 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12410977 ]
          Michael Busch made changes -
          Attachment TestCompatibility.java [ 12410983 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12411006 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12411006 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12411031 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12411069 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12411031 ]
          Mark Miller made changes -
          Link This issue blocks LUCENE-1696 [ LUCENE-1696 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12411221 ]
          Michael Busch made changes -
          Attachment TestCompatibility.java [ 12411610 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12411640 ]
          Michael Busch made changes -
          Attachment TestCompatibility.java [ 12411706 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12412979 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12413203 ]
          Uwe Schindler made changes -
          Link This issue blocks LUCENE-1460 [ LUCENE-1460 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12413451 ]
          Michael Busch made changes -
          Attachment TestAPIBackwardsCompatibility.java [ 12413526 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12413529 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12413531 ]
          Michael Busch made changes -
          Attachment lucene-1693.patch [ 12413662 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12413689 ]
          Michael Busch made changes -
          Attachment lucene-1693.patch [ 12413783 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12413875 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12413964 ]
          Attachment PerfTest3.java [ 12413965 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12413967 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12413964 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693.patch [ 12414148 ]
          Michael Busch made changes -
          Attachment lucene-1693.patch [ 12414170 ]
          Uwe Schindler made changes -
          Description This patch makes the following improvements to AttributeSource and
          TokenStream/Filter:

          - removes the set/getUseNewAPI() methods (including the standard
            ones). Instead by default incrementToken() throws a subclass of
            UnsupportedOperationException. The indexer tries to call
            incrementToken() initially once to see if the exception is thrown;
            if so, it falls back to the old API.

          - introduces interfaces for all Attributes. The corresponding
            implementations have the postfix 'Impl', e.g. TermAttribute and
            TermAttributeImpl. AttributeSource now has a factory for creating
            the Attribute instances; the default implementation looks for
            implementing classes with the postfix 'Impl'. Token now implements
            all 6 TokenAttribute interfaces.

          - new method added to AttributeSource:
            addAttributeImpl(AttributeImpl). Using reflection it walks up in the
            class hierarchy of the passed in object and finds all interfaces
            that the class or superclasses implement and that extend the
            Attribute interface. It then adds the interface->instance mappings
            to the attribute map for each of the found interfaces.

          - AttributeImpl now has a default implementation of toString that uses
            reflection to print out the values of the attributes in a default
            formatting. This makes it a bit easier to implement AttributeImpl,
            because toString() was declared abstract before.

          - Cloning is now done much more efficiently in
            captureState. The method figures out which unique AttributeImpl
            instances are contained as values in the attributes map, because
            those are the ones that need to be cloned. It creates a single
            linked list that supports deep cloning (in the inner class
            AttributeSource.State). AttributeSource keeps track of when this
            state changes, i.e. whenever new attributes are added to the
            AttributeSource. Only in that case will captureState recompute the
            state, otherwise it will simply clone the precomputed state and
            return the clone. restoreState(AttributeSource.State) walks the
            linked list and uses the copyTo() method of AttributeImpl to copy
            all values over into the attribute that the source stream
            (e.g. SinkTokenizer) uses.

          The cloning performance can be greatly improved if not multiple
          AttributeImpl instances are used in one TokenStream. A user can
          e.g. simply add a Token instance to the stream instead of the individual
          attributes. Or the user could implement a subclass of AttributeImpl that
          implements exactly the Attribute interfaces needed. I think this
          should be considered an expert API (addAttributeImpl), as this manual
          optimization is only needed if cloning performance is crucial. I ran
          some quick performance tests using Tee/Sink tokenizers (which do
          cloning) and the performance was roughly 20% faster with the new
          API. I'll run some more performance tests and post more numbers then.

          Note also that when we add serialization to the Attributes, e.g. for
          supporting storing serialized TokenStreams in the index, then the
          serialization should benefit even significantly more from the new API
          than cloning.

          Also, the TokenStream API does not change, except for the removal
          of the set/getUseNewAPI methods. So the patches in LUCENE-1460
          should still work.

          All core tests pass, however, I need to update all the documentation
          and also add some unit tests for the new AttributeSource
          functionality. So this patch is not ready to commit yet, but I wanted
          to post it already for some feedback.
          This patch makes the following improvements to AttributeSource and
          TokenStream/Filter:

          - introduces interfaces for all Attributes. The corresponding
            implementations have the postfix 'Impl', e.g. TermAttribute and
            TermAttributeImpl. AttributeSource now has a factory for creating
            the Attribute instances; the default implementation looks for
            implementing classes with the postfix 'Impl'. Token now implements
            all 6 TokenAttribute interfaces.

          - new method added to AttributeSource:
            addAttributeImpl(AttributeImpl). Using reflection it walks up in the
            class hierarchy of the passed in object and finds all interfaces
            that the class or superclasses implement and that extend the
            Attribute interface. It then adds the interface->instance mappings
            to the attribute map for each of the found interfaces.

          - removes the set/getUseNewAPI() methods (including the standard
            ones). Instead it is now enough to only implement the new API, if one old TokenStream implements still the old API (next()/next(Token)), it is wrapped automatically. The delegation path is determined via reflection (the patch determines, which of the three methods was overridden).

          - Token is no longer deprecated, instead it implements all 6 standard token interfaces (see above). The wrapper for next() and next(Token) uses this, to automatically map all attribute interfaces to one TokenWrapper instance (implementing all 6 interfaces), that contains a Token instance. next() and next(Token) exchange the inner Token instance as needed. For the new incrementToken(), only one TokenWrapper instance is visible, delegating to the currect reusable Token. This API also preserves custom Token subclasses, that maybe created by very special token streams (see example in Backwards-Test).

          - AttributeImpl now has a default implementation of toString that uses
            reflection to print out the values of the attributes in a default
            formatting. This makes it a bit easier to implement AttributeImpl,
            because toString() was declared abstract before.

          - Cloning is now done much more efficiently in
            captureState. The method figures out which unique AttributeImpl
            instances are contained as values in the attributes map, because
            those are the ones that need to be cloned. It creates a single
            linked list that supports deep cloning (in the inner class
            AttributeSource.State). AttributeSource keeps track of when this
            state changes, i.e. whenever new attributes are added to the
            AttributeSource. Only in that case will captureState recompute the
            state, otherwise it will simply clone the precomputed state and
            return the clone. restoreState(AttributeSource.State) walks the
            linked list and uses the copyTo() method of AttributeImpl to copy
            all values over into the attribute that the source stream
            (e.g. SinkTokenizer) uses.

          The cloning performance can be greatly improved if not multiple
          AttributeImpl instances are used in one TokenStream. A user can
          e.g. simply add a Token instance to the stream instead of the individual
          attributes. Or the user could implement a subclass of AttributeImpl that
          implements exactly the Attribute interfaces needed. I think this
          should be considered an expert API (addAttributeImpl), as this manual
          optimization is only needed if cloning performance is crucial. I ran
          some quick performance tests using Tee/Sink tokenizers (which do
          cloning) and the performance was roughly 20% faster with the new
          API. I'll run some more performance tests and post more numbers then.

          Note also that when we add serialization to the Attributes, e.g. for
          supporting storing serialized TokenStreams in the index, then the
          serialization should benefit even significantly more from the new API
          than cloning.

          This issue contains one backwards-compatibility break:
          TokenStreams/Filters/Tokenizers should normally be final (see LUCENE-1753 for the explaination). Some of these core classes are not final and so one could override the next() or next(Token) methods. In this case, the backwards-wrapper would automatically use incrementToken(), because it is implemented, so the overridden method is never called. To prevent users from errors not visible during compilation or testing (the streams just behave wrong), this patch makes all implementation methods final (next(), next(Token), incrementToken()), whenever the class itsself is not final. This is a BW break, but users will clearly see, that they have done something unsupoorted and should better create a custom TokenFilter with their additional implementation (instead of extending a core implementation).

          For further changing contrib token streams the following procedere should be used:

              * rewrite and replace next(Token)/next() implementations by new API
              * if the class is final, no next(Token)/next() methods needed (must be removed!!!)
              * if the class is non-final add the following methods to the class:
          {code:java}
                /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should
                 * not be overridden. Delegates to the backwards compatibility layer. */
                public final Token next(final Token reusableToken) throws java.io.IOException {
                  return super.next(reusableToken);
                }

                /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should
                 * not be overridden. Delegates to the backwards compatibility layer. */
                public final Token next() throws java.io.IOException {
                  return super.next();
                }
          {code}
           Also the incrementToken() method must be final in this case (and the new method end() of LUCENE-1448)
          Uwe Schindler made changes -
          Description This patch makes the following improvements to AttributeSource and
          TokenStream/Filter:

          - introduces interfaces for all Attributes. The corresponding
            implementations have the postfix 'Impl', e.g. TermAttribute and
            TermAttributeImpl. AttributeSource now has a factory for creating
            the Attribute instances; the default implementation looks for
            implementing classes with the postfix 'Impl'. Token now implements
            all 6 TokenAttribute interfaces.

          - new method added to AttributeSource:
            addAttributeImpl(AttributeImpl). Using reflection it walks up in the
            class hierarchy of the passed in object and finds all interfaces
            that the class or superclasses implement and that extend the
            Attribute interface. It then adds the interface->instance mappings
            to the attribute map for each of the found interfaces.

          - removes the set/getUseNewAPI() methods (including the standard
            ones). Instead it is now enough to only implement the new API, if one old TokenStream implements still the old API (next()/next(Token)), it is wrapped automatically. The delegation path is determined via reflection (the patch determines, which of the three methods was overridden).

          - Token is no longer deprecated, instead it implements all 6 standard token interfaces (see above). The wrapper for next() and next(Token) uses this, to automatically map all attribute interfaces to one TokenWrapper instance (implementing all 6 interfaces), that contains a Token instance. next() and next(Token) exchange the inner Token instance as needed. For the new incrementToken(), only one TokenWrapper instance is visible, delegating to the currect reusable Token. This API also preserves custom Token subclasses, that maybe created by very special token streams (see example in Backwards-Test).

          - AttributeImpl now has a default implementation of toString that uses
            reflection to print out the values of the attributes in a default
            formatting. This makes it a bit easier to implement AttributeImpl,
            because toString() was declared abstract before.

          - Cloning is now done much more efficiently in
            captureState. The method figures out which unique AttributeImpl
            instances are contained as values in the attributes map, because
            those are the ones that need to be cloned. It creates a single
            linked list that supports deep cloning (in the inner class
            AttributeSource.State). AttributeSource keeps track of when this
            state changes, i.e. whenever new attributes are added to the
            AttributeSource. Only in that case will captureState recompute the
            state, otherwise it will simply clone the precomputed state and
            return the clone. restoreState(AttributeSource.State) walks the
            linked list and uses the copyTo() method of AttributeImpl to copy
            all values over into the attribute that the source stream
            (e.g. SinkTokenizer) uses.

          The cloning performance can be greatly improved if not multiple
          AttributeImpl instances are used in one TokenStream. A user can
          e.g. simply add a Token instance to the stream instead of the individual
          attributes. Or the user could implement a subclass of AttributeImpl that
          implements exactly the Attribute interfaces needed. I think this
          should be considered an expert API (addAttributeImpl), as this manual
          optimization is only needed if cloning performance is crucial. I ran
          some quick performance tests using Tee/Sink tokenizers (which do
          cloning) and the performance was roughly 20% faster with the new
          API. I'll run some more performance tests and post more numbers then.

          Note also that when we add serialization to the Attributes, e.g. for
          supporting storing serialized TokenStreams in the index, then the
          serialization should benefit even significantly more from the new API
          than cloning.

          This issue contains one backwards-compatibility break:
          TokenStreams/Filters/Tokenizers should normally be final (see LUCENE-1753 for the explaination). Some of these core classes are not final and so one could override the next() or next(Token) methods. In this case, the backwards-wrapper would automatically use incrementToken(), because it is implemented, so the overridden method is never called. To prevent users from errors not visible during compilation or testing (the streams just behave wrong), this patch makes all implementation methods final (next(), next(Token), incrementToken()), whenever the class itsself is not final. This is a BW break, but users will clearly see, that they have done something unsupoorted and should better create a custom TokenFilter with their additional implementation (instead of extending a core implementation).

          For further changing contrib token streams the following procedere should be used:

              * rewrite and replace next(Token)/next() implementations by new API
              * if the class is final, no next(Token)/next() methods needed (must be removed!!!)
              * if the class is non-final add the following methods to the class:
          {code:java}
                /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should
                 * not be overridden. Delegates to the backwards compatibility layer. */
                public final Token next(final Token reusableToken) throws java.io.IOException {
                  return super.next(reusableToken);
                }

                /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should
                 * not be overridden. Delegates to the backwards compatibility layer. */
                public final Token next() throws java.io.IOException {
                  return super.next();
                }
          {code}
           Also the incrementToken() method must be final in this case (and the new method end() of LUCENE-1448)
          This patch makes the following improvements to AttributeSource and
          TokenStream/Filter:

          - introduces interfaces for all Attributes. The corresponding
            implementations have the postfix 'Impl', e.g. TermAttribute and
            TermAttributeImpl. AttributeSource now has a factory for creating
            the Attribute instances; the default implementation looks for
            implementing classes with the postfix 'Impl'. Token now implements
            all 6 TokenAttribute interfaces.

          - new method added to AttributeSource:
            addAttributeImpl(AttributeImpl). Using reflection it walks up in the
            class hierarchy of the passed in object and finds all interfaces
            that the class or superclasses implement and that extend the
            Attribute interface. It then adds the interface->instance mappings
            to the attribute map for each of the found interfaces.

          - removes the set/getUseNewAPI() methods (including the standard
            ones). Instead it is now enough to only implement the new API,
            if one old TokenStream implements still the old API (next()/next(Token)),
            it is wrapped automatically. The delegation path is determined via
            reflection (the patch determines, which of the three methods was
            overridden).

          - Token is no longer deprecated, instead it implements all 6 standard
            token interfaces (see above). The wrapper for next() and next(Token)
            uses this, to automatically map all attribute interfaces to one
            TokenWrapper instance (implementing all 6 interfaces), that contains
            a Token instance. next() and next(Token) exchange the inner Token
            instance as needed. For the new incrementToken(), only one
            TokenWrapper instance is visible, delegating to the currect reusable
            Token. This API also preserves custom Token subclasses, that maybe
            created by very special token streams (see example in Backwards-Test).

          - AttributeImpl now has a default implementation of toString that uses
            reflection to print out the values of the attributes in a default
            formatting. This makes it a bit easier to implement AttributeImpl,
            because toString() was declared abstract before.

          - Cloning is now done much more efficiently in
            captureState. The method figures out which unique AttributeImpl
            instances are contained as values in the attributes map, because
            those are the ones that need to be cloned. It creates a single
            linked list that supports deep cloning (in the inner class
            AttributeSource.State). AttributeSource keeps track of when this
            state changes, i.e. whenever new attributes are added to the
            AttributeSource. Only in that case will captureState recompute the
            state, otherwise it will simply clone the precomputed state and
            return the clone. restoreState(AttributeSource.State) walks the
            linked list and uses the copyTo() method of AttributeImpl to copy
            all values over into the attribute that the source stream
            (e.g. SinkTokenizer) uses.

          - Tee- and SinkTokenizer were deprecated, because they use
          Token instances for caching. This is not compatible to the new API
          using AttributeSource.State objects. You can still use the old
          deprecated ones, but new features provided by new Attribute types
          may get lost in the chain. A replacement is a new TeeSinkTokenFilter,
          which has a factory to create new Sink instances, that have compatible
          attributes. Sink instances created by one Tee can also be added to
          another Tee, as long as the attribute implementations are compatible
          (it is not possible to add a sink from a tee using one Token instance
          to a tee using the six separate attribute impls). In this case UOE is thrown.

          The cloning performance can be greatly improved if not multiple
          AttributeImpl instances are used in one TokenStream. A user can
          e.g. simply add a Token instance to the stream instead of the individual
          attributes. Or the user could implement a subclass of AttributeImpl that
          implements exactly the Attribute interfaces needed. I think this
          should be considered an expert API (addAttributeImpl), as this manual
          optimization is only needed if cloning performance is crucial. I ran
          some quick performance tests using Tee/Sink tokenizers (which do
          cloning) and the performance was roughly 20% faster with the new
          API. I'll run some more performance tests and post more numbers then.

          Note also that when we add serialization to the Attributes, e.g. for
          supporting storing serialized TokenStreams in the index, then the
          serialization should benefit even significantly more from the new API
          than cloning.

          This issue contains one backwards-compatibility break:
          TokenStreams/Filters/Tokenizers should normally be final
          (see LUCENE-1753 for the explaination). Some of these core classes are
          not final and so one could override the next() or next(Token) methods.
          In this case, the backwards-wrapper would automatically use
          incrementToken(), because it is implemented, so the overridden
          method is never called. To prevent users from errors not visible
          during compilation or testing (the streams just behave wrong),
          this patch makes all implementation methods final
          (next(), next(Token), incrementToken()), whenever the class
          itsself is not final. This is a BW break, but users will clearly see,
          that they have done something unsupoorted and should better
          create a custom TokenFilter with their additional implementation
          (instead of extending a core implementation).

          For further changing contrib token streams the following procedere should be used:

              * rewrite and replace next(Token)/next() implementations by new API
              * if the class is final, no next(Token)/next() methods needed (must be removed!!!)
              * if the class is non-final add the following methods to the class:
          {code:java}
                /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should
                 * not be overridden. Delegates to the backwards compatibility layer. */
                public final Token next(final Token reusableToken) throws java.io.IOException {
                  return super.next(reusableToken);
                }

                /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should
                 * not be overridden. Delegates to the backwards compatibility layer. */
                public final Token next() throws java.io.IOException {
                  return super.next();
                }
          {code}
          Also the incrementToken() method must be final in this case
          (and the new method end() of LUCENE-1448)
          Michael Busch made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Uwe Schindler made changes -
          Attachment LUCENE-1693-TokenizerAttrFactory.patch [ 12414506 ]
          Mark Miller made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Mark Thomas made changes -
          Workflow jira [ 12465979 ] Default workflow, editable Closed status [ 12563083 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12563083 ] jira [ 12584102 ]
          Gavin made changes -
          Link This issue blocks LUCENE-1460 [ LUCENE-1460 ]
          Gavin made changes -
          Link This issue is depended upon by LUCENE-1460 [ LUCENE-1460 ]

            People

            • Assignee:
              Michael Busch
              Reporter:
              Michael Busch
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development