[LUCENE-1693] AttributeSource/TokenStream API improvements - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.9
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New, Patch Available

Description

This patch makes the following improvements to AttributeSource and
TokenStream/Filter:

introduces interfaces for all Attributes. The corresponding
implementations have the postfix 'Impl', e.g. TermAttribute and
TermAttributeImpl. AttributeSource now has a factory for creating
the Attribute instances; the default implementation looks for
implementing classes with the postfix 'Impl'. Token now implements
all 6 TokenAttribute interfaces.

new method added to AttributeSource:
addAttributeImpl(AttributeImpl). Using reflection it walks up in the
class hierarchy of the passed in object and finds all interfaces
that the class or superclasses implement and that extend the
Attribute interface. It then adds the interface->instance mappings
to the attribute map for each of the found interfaces.

removes the set/getUseNewAPI() methods (including the standard
ones). Instead it is now enough to only implement the new API,
if one old TokenStream implements still the old API (next()/next(Token)),
it is wrapped automatically. The delegation path is determined via
reflection (the patch determines, which of the three methods was
overridden).

Token is no longer deprecated, instead it implements all 6 standard
token interfaces (see above). The wrapper for next() and next(Token)
uses this, to automatically map all attribute interfaces to one
TokenWrapper instance (implementing all 6 interfaces), that contains
a Token instance. next() and next(Token) exchange the inner Token
instance as needed. For the new incrementToken(), only one
TokenWrapper instance is visible, delegating to the currect reusable
Token. This API also preserves custom Token subclasses, that maybe
created by very special token streams (see example in Backwards-Test).

AttributeImpl now has a default implementation of toString that uses
reflection to print out the values of the attributes in a default
formatting. This makes it a bit easier to implement AttributeImpl,
because toString() was declared abstract before.

Cloning is now done much more efficiently in
captureState. The method figures out which unique AttributeImpl
instances are contained as values in the attributes map, because
those are the ones that need to be cloned. It creates a single
linked list that supports deep cloning (in the inner class
AttributeSource.State). AttributeSource keeps track of when this
state changes, i.e. whenever new attributes are added to the
AttributeSource. Only in that case will captureState recompute the
state, otherwise it will simply clone the precomputed state and
return the clone. restoreState(AttributeSource.State) walks the
linked list and uses the copyTo() method of AttributeImpl to copy
all values over into the attribute that the source stream
(e.g. SinkTokenizer) uses.

Tee- and SinkTokenizer were deprecated, because they use
Token instances for caching. This is not compatible to the new API
using AttributeSource.State objects. You can still use the old
deprecated ones, but new features provided by new Attribute types
may get lost in the chain. A replacement is a new TeeSinkTokenFilter,
which has a factory to create new Sink instances, that have compatible
attributes. Sink instances created by one Tee can also be added to
another Tee, as long as the attribute implementations are compatible
(it is not possible to add a sink from a tee using one Token instance
to a tee using the six separate attribute impls). In this case UOE is thrown.

The cloning performance can be greatly improved if not multiple
AttributeImpl instances are used in one TokenStream. A user can
e.g. simply add a Token instance to the stream instead of the individual
attributes. Or the user could implement a subclass of AttributeImpl that
implements exactly the Attribute interfaces needed. I think this
should be considered an expert API (addAttributeImpl), as this manual
optimization is only needed if cloning performance is crucial. I ran
some quick performance tests using Tee/Sink tokenizers (which do
cloning) and the performance was roughly 20% faster with the new
API. I'll run some more performance tests and post more numbers then.

Note also that when we add serialization to the Attributes, e.g. for
supporting storing serialized TokenStreams in the index, then the
serialization should benefit even significantly more from the new API
than cloning.

This issue contains one backwards-compatibility break:
TokenStreams/Filters/Tokenizers should normally be final
(see ~~LUCENE-1753~~ for the explaination). Some of these core classes are
not final and so one could override the next() or next(Token) methods.
In this case, the backwards-wrapper would automatically use
incrementToken(), because it is implemented, so the overridden
method is never called. To prevent users from errors not visible
during compilation or testing (the streams just behave wrong),
this patch makes all implementation methods final
(next(), next(Token), incrementToken()), whenever the class
itsself is not final. This is a BW break, but users will clearly see,
that they have done something unsupoorted and should better
create a custom TokenFilter with their additional implementation
(instead of extending a core implementation).

For further changing contrib token streams the following procedere should be used:

rewrite and replace next(Token)/next() implementations by new API
if the class is final, no next(Token)/next() methods needed (must be removed!!!)

if the class is non-final add the following methods to the class:

      /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should
       * not be overridden. Delegates to the backwards compatibility layer. */
      public final Token next(final Token reusableToken) throws java.io.IOException {
        return super.next(reusableToken);
      }

      /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should
       * not be overridden. Delegates to the backwards compatibility layer. */
      public final Token next() throws java.io.IOException {
        return super.next();
      }

Also the incrementToken() method must be final in this case
(and the new method end() of ~~LUCENE-1448~~)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TestCompatibility.java
17/Jun/09 06:44
3 kB
Michael Busch
TestCompatibility.java
17/Jun/09 21:27
3 kB
Michael Busch
TestCompatibility.java
24/Jun/09 06:45
8 kB
Michael Busch
TestCompatibility.java
24/Jun/09 21:12
8 kB
Michael Busch
TestAPIBackwardsCompatibility.java
15/Jul/09 05:46
17 kB
Michael Busch
PerfTest3.java
19/Jul/09 21:23
4 kB
Uwe Schindler
LUCENE-1693-TokenizerAttrFactory.patch
25/Jul/09 09:05
1 kB
Uwe Schindler
LUCENE-1693.patch
17/Jun/09 00:21
96 kB
Uwe Schindler
LUCENE-1693.patch
17/Jun/09 12:35
106 kB
Uwe Schindler
LUCENE-1693.patch
17/Jun/09 21:00
109 kB
Uwe Schindler
LUCENE-1693.patch
18/Jun/09 12:05
109 kB
Uwe Schindler
LUCENE-1693.patch
19/Jun/09 12:41
126 kB
Uwe Schindler
LUCENE-1693.patch
24/Jun/09 12:42
125 kB
Uwe Schindler
LUCENE-1693.patch
09/Jul/09 09:14
140 kB
Uwe Schindler
LUCENE-1693.patch
11/Jul/09 14:50
142 kB
Uwe Schindler
LUCENE-1693.patch
14/Jul/09 17:36
172 kB
Uwe Schindler
LUCENE-1693.patch
15/Jul/09 07:18
172 kB
Uwe Schindler
LUCENE-1693.patch
15/Jul/09 07:55
172 kB
Uwe Schindler
LUCENE-1693.patch
16/Jul/09 14:02
196 kB
Uwe Schindler
LUCENE-1693.patch
17/Jul/09 20:59
199 kB
Uwe Schindler
LUCENE-1693.patch
19/Jul/09 23:01
211 kB
Uwe Schindler
LUCENE-1693.patch
21/Jul/09 21:59
219 kB
Uwe Schindler
lucene-1693.patch
16/Jun/09 09:52
111 kB
Michael Busch
lucene-1693.patch
16/Jul/09 08:48
174 kB
Michael Busch
lucene-1693.patch
17/Jul/09 08:39
194 kB
Michael Busch
lucene-1693.patch
22/Jul/09 01:06
224 kB
Michael Busch

Issue Links

blocks

LUCENE-1696 Added New Token API impl for ASCIIFoldingFilter

Closed

is depended upon by

LUCENE-1460 Change all contrib TokenStreams/Filters to use the new TokenStream API

Closed

relates to

LUCENE-1695 Update the Highlighter to use the new TokenStream API

Closed

LUCENE-1697 MoreLikeThis should use the new Token API

Closed

AttributeSource/TokenStream API improvements

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates