Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
-
New, Patch Available
Description
This patch makes the following improvements to AttributeSource and
TokenStream/Filter:
- introduces interfaces for all Attributes. The corresponding
implementations have the postfix 'Impl', e.g. TermAttribute and
TermAttributeImpl. AttributeSource now has a factory for creating
the Attribute instances; the default implementation looks for
implementing classes with the postfix 'Impl'. Token now implements
all 6 TokenAttribute interfaces.
- new method added to AttributeSource:
addAttributeImpl(AttributeImpl). Using reflection it walks up in the
class hierarchy of the passed in object and finds all interfaces
that the class or superclasses implement and that extend the
Attribute interface. It then adds the interface->instance mappings
to the attribute map for each of the found interfaces.
- removes the set/getUseNewAPI() methods (including the standard
ones). Instead it is now enough to only implement the new API,
if one old TokenStream implements still the old API (next()/next(Token)),
it is wrapped automatically. The delegation path is determined via
reflection (the patch determines, which of the three methods was
overridden).
- Token is no longer deprecated, instead it implements all 6 standard
token interfaces (see above). The wrapper for next() and next(Token)
uses this, to automatically map all attribute interfaces to one
TokenWrapper instance (implementing all 6 interfaces), that contains
a Token instance. next() and next(Token) exchange the inner Token
instance as needed. For the new incrementToken(), only one
TokenWrapper instance is visible, delegating to the currect reusable
Token. This API also preserves custom Token subclasses, that maybe
created by very special token streams (see example in Backwards-Test).
- AttributeImpl now has a default implementation of toString that uses
reflection to print out the values of the attributes in a default
formatting. This makes it a bit easier to implement AttributeImpl,
because toString() was declared abstract before.
- Cloning is now done much more efficiently in
captureState. The method figures out which unique AttributeImpl
instances are contained as values in the attributes map, because
those are the ones that need to be cloned. It creates a single
linked list that supports deep cloning (in the inner class
AttributeSource.State). AttributeSource keeps track of when this
state changes, i.e. whenever new attributes are added to the
AttributeSource. Only in that case will captureState recompute the
state, otherwise it will simply clone the precomputed state and
return the clone. restoreState(AttributeSource.State) walks the
linked list and uses the copyTo() method of AttributeImpl to copy
all values over into the attribute that the source stream
(e.g. SinkTokenizer) uses.
- Tee- and SinkTokenizer were deprecated, because they use
Token instances for caching. This is not compatible to the new API
using AttributeSource.State objects. You can still use the old
deprecated ones, but new features provided by new Attribute types
may get lost in the chain. A replacement is a new TeeSinkTokenFilter,
which has a factory to create new Sink instances, that have compatible
attributes. Sink instances created by one Tee can also be added to
another Tee, as long as the attribute implementations are compatible
(it is not possible to add a sink from a tee using one Token instance
to a tee using the six separate attribute impls). In this case UOE is thrown.
The cloning performance can be greatly improved if not multiple
AttributeImpl instances are used in one TokenStream. A user can
e.g. simply add a Token instance to the stream instead of the individual
attributes. Or the user could implement a subclass of AttributeImpl that
implements exactly the Attribute interfaces needed. I think this
should be considered an expert API (addAttributeImpl), as this manual
optimization is only needed if cloning performance is crucial. I ran
some quick performance tests using Tee/Sink tokenizers (which do
cloning) and the performance was roughly 20% faster with the new
API. I'll run some more performance tests and post more numbers then.
Note also that when we add serialization to the Attributes, e.g. for
supporting storing serialized TokenStreams in the index, then the
serialization should benefit even significantly more from the new API
than cloning.
This issue contains one backwards-compatibility break:
TokenStreams/Filters/Tokenizers should normally be final
(see LUCENE-1753 for the explaination). Some of these core classes are
not final and so one could override the next() or next(Token) methods.
In this case, the backwards-wrapper would automatically use
incrementToken(), because it is implemented, so the overridden
method is never called. To prevent users from errors not visible
during compilation or testing (the streams just behave wrong),
this patch makes all implementation methods final
(next(), next(Token), incrementToken()), whenever the class
itsself is not final. This is a BW break, but users will clearly see,
that they have done something unsupoorted and should better
create a custom TokenFilter with their additional implementation
(instead of extending a core implementation).
For further changing contrib token streams the following procedere should be used:
- rewrite and replace next(Token)/next() implementations by new API
- if the class is final, no next(Token)/next() methods needed (must be removed!!!)
- if the class is non-final add the following methods to the class:
/** @deprecated Will be removed in Lucene 3.0. This method is final, as it should * not be overridden. Delegates to the backwards compatibility layer. */ public final Token next(final Token reusableToken) throws java.io.IOException { return super.next(reusableToken); } /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should * not be overridden. Delegates to the backwards compatibility layer. */ public final Token next() throws java.io.IOException { return super.next(); }
Also the incrementToken() method must be final in this case
(and the new method end() ofLUCENE-1448)
Attachments
Attachments
- lucene-1693.patch
- 224 kB
- Michael Busch
- lucene-1693.patch
- 194 kB
- Michael Busch
- lucene-1693.patch
- 174 kB
- Michael Busch
- lucene-1693.patch
- 111 kB
- Michael Busch
- LUCENE-1693.patch
- 219 kB
- Uwe Schindler
- LUCENE-1693.patch
- 211 kB
- Uwe Schindler
- LUCENE-1693.patch
- 199 kB
- Uwe Schindler
- LUCENE-1693.patch
- 196 kB
- Uwe Schindler
- LUCENE-1693.patch
- 172 kB
- Uwe Schindler
- LUCENE-1693.patch
- 172 kB
- Uwe Schindler
- LUCENE-1693.patch
- 172 kB
- Uwe Schindler
- LUCENE-1693.patch
- 142 kB
- Uwe Schindler
- LUCENE-1693.patch
- 140 kB
- Uwe Schindler
- LUCENE-1693.patch
- 125 kB
- Uwe Schindler
- LUCENE-1693.patch
- 126 kB
- Uwe Schindler
- LUCENE-1693.patch
- 109 kB
- Uwe Schindler
- LUCENE-1693.patch
- 109 kB
- Uwe Schindler
- LUCENE-1693.patch
- 106 kB
- Uwe Schindler
- LUCENE-1693.patch
- 96 kB
- Uwe Schindler
- LUCENE-1693-TokenizerAttrFactory.patch
- 1 kB
- Uwe Schindler
- PerfTest3.java
- 4 kB
- Uwe Schindler
- TestAPIBackwardsCompatibility.java
- 17 kB
- Michael Busch
- TestCompatibility.java
- 8 kB
- Michael Busch
- TestCompatibility.java
- 8 kB
- Michael Busch
- TestCompatibility.java
- 3 kB
- Michael Busch
- TestCompatibility.java
- 3 kB
- Michael Busch
Issue Links
- blocks
-
LUCENE-1696 Added New Token API impl for ASCIIFoldingFilter
- Closed
- is depended upon by
-
LUCENE-1460 Change all contrib TokenStreams/Filters to use the new TokenStream API
- Closed
- relates to
-
LUCENE-1695 Update the Highlighter to use the new TokenStream API
- Closed
-
LUCENE-1697 MoreLikeThis should use the new Token API
- Closed
Activity
Why do you add a new class "SmallToken"? I think it should be the good old deprecated "Token".
In my opinion, I would not deprecate the old Token, instead the default factory should always create Token instead of all these default *Impl attributes.
What was your concusion about my idea yesterday, to pass the Token around even with the old API and copy it on demand, if return value by next() is not the same? In this case, Token should be the only implementation of all standard attributes.
Why do you add a new class "SmallToken"? I think it should be the good old deprecated "Token".
I should have added a "no commit" comment to SmallToken. I just used it for performance tests, we don't have to commit it. It's an example of what an advanced user could do to speed up cloning. Token is still there and unchanged.
In my opinion, I would not deprecate the old Token, instead the default factory should always create Token instead of all these default *Impl attributes.
Yeah we shouldn't deprecate Token. Not sure about the default. It kind of depends on the use case. Cloning and memory is faster I think with only two or three Attributes compared to using the Token cloning is faster. But it'd be simpler to not have all the Impl classes... hmm not sure...
What was your concusion about my idea yesterday, to pass the Token around even with the old API and copy it on demand
I don't think the indexer should know at all about Token? It should only use the interfaces so that we can maintain the full flexibility.
Also I don't really like the fact very much that some user might get a performance hit. I had the idea to throw the exception in incrementToken() to automatically being able to fallback to the old API. I think this is nice and gets rid of the explicit useNewAPI methods. The only drawback is still the fact that we have to implement both old and new APIs in Lucene's tokenizers and filters until we remove the old API. Grant won't like this but I think this is better than the possible performance hit? Also we don't add new filters THAT often to Lucene and implementing both methods is often mostly copy&paste.
Do you like the idea with the UnsupportedOperationException?
What was your concusion about my idea yesterday, to pass the Token around even with the old API and copy it on demand
I don't think the indexer should know at all about Token? It should only use the interfaces so that we can maintain the full flexibility.
Also I don't really like the fact very much that some user might get a performance hit. I had the idea to throw the exception in incrementToken() to automatically being able to fallback to the old API. I think this is nice and gets rid of the explicit useNewAPI methods. The only drawback is still the fact that we have to implement both old and new APIs in Lucene's tokenizers and filters until we remove the old API. Grant won't like this but I think this is better than the possible performance hit? Also we don't add new filters THAT often to Lucene and implementing both methods is often mostly copy&paste.
I did not mean to use Token in the indexer, I would like to remove the whole old API from the indexer (even the UOE). My idea would be the following:
- Let the default interfaces implemented by Token, the default factory creates it for all requests to the default attributes
- If somebody implements the new API, the indexer can use it without problems. If he doesn't, the default impl in TokenStream would call next(Token), using the token instance from AttributeSource. If the method returns with another Token instance (because it did not reuse, which is seldom I think), copy this returned token into the per instance AttributeSource Token.
- The other way round, if one uses a TokenStream with the old API (own indexer, query parser,...), the TokenStream only implemented the new API, the deprectated old method would also have a default impl, ignoring the token supplied to next() and returning always the instance-token after calling incrementToken().
Because of this, the indexer would always use the new API, the old API is wrapped. Core analyzers only need to implement the new methods (or could even stay with the old).
There are two problems:
- If somebody does not implement either method, the call to incrementToken() will loop endless, there should be some early break with UOE in this case. Do not know how to implement this without inspecting the stack trace in the default methods.
- Filters are a bit tricky: They could pass in both methods per default to the delegate. The problem, if one only implements one of the methods, the other passes down to the delegate, and doing nothing. But this could be fixed by delegating in the deprecated old method always to the new one (or vice versa).
Hope this was clear, maybe I should create a patch.
- If somebody implements the new API, the indexer can use it without problems. If he doesn't, the default impl in TokenStream would call next(Token), using the token instance from AttributeSource. If the method returns with another Token instance (because it did not reuse, which is seldom I think), copy this returned token into the per instance AttributeSource Token.
What if you currently have a filter that caches the token instance passed into next(Token) and returns a different one. If you then copy the values from the returned token into the per instance Attribute Token, you alter the cached one, because it's the same instance.
Not sure if this is common or even allowed, but these are the sideeffects I'm worried about.
What if you currently have a filter that caches the token instance passed into next(Token) and returns a different one. If you then copy the values from the returned token into the per instance Attribute Token, you alter the cached one, because it's the same instance.
Not sure if this is common or even allowed, but these are the sideeffects I'm worried about.
See:
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/TokenStream.html
It states in next(Token): Also, the producer must make no assumptions about a Token after it has been returned: the caller may arbitrarily change it. If the producer needs to hold onto the token for subsequent calls, it must clone() it before storing it. Note that a TokenFilter is considered a consumer.
So I see no problem with this.
I have a couple of TokenFilters that work that way - get a Token from the wrapped TokenStream, cache it (cloning) and return another Token instead. Especially as the Tokenizer fills the Token w/ characters, I think that cloning is the only option if you want to hold on to a Token that was returned by a wrapped TokenStream.
If you clone, you would not fall into the mentioned problem. The reuseable Token passed to next() is owned by the caller.
But, the additional copying would affect performance in Shai's case. The same with CachingTokenFilter and Tee/Sink filer/tokenizer. I don't think anymore it's such a corner case to not return the reusable token.
From a user's perspective, the UnsupportedOperationException solution and your solution should not be distinguishable. The advantage of your solution is cleaner, less cluttered code, and it removes the burden from the developers to implement both APIs in the transition period.The disadvantage is a possible performance hit in the cases mentioned above.
The advantage of the UOE solution is no performance hit for users of the old API, the disadvantage the need to implement both APIs.
Even though I think your solution is very elegant, Uwe, I'm in favor of the one that doesn't affect performance significantly. What do you or others think about the possible performance hit? Not a big deal? Or worth keeping the old code around for a while?
Just linking some of these related issues together such that we can make sure we cover all of the bases with this new Token Stream stuff
What do you or others think about the possible performance hit?
I haven't been able to keep up with this, but this one caught my eye!
I think performance of analysis, especially the common case (all my tokenizers are new, I'm not cloning, etc.) is important. It sounds like this performance hit was specifically on cloning... but still we should try not to lose performance if we can help it.
I think cloning is more prevalent than people think. For instance, the RemoveDuplicatesTokenFilter in Solr is in the default schema and is thus used, albeit naively, by a whole lot of people. Furthermore, I've seen quite a few apps that need to buffer tokens before spitting them out, things like phrase detection, n-grams, entity extraction, sentence detection, part of speech detection, parsing, etc.
Perhaps I'm missing something, but I don't understand what is the performance hit that's been discussed. If you don't need to clone a Token in order to cache it, don't do it and I agree we shouldn't require it or anything.
In my scenario, I do need to cache a Token, and then on subsequent calls I use it to generate new Tokens. In next(Token) I populate the given Token with information from the cached Token. That goes until it's consumed, and then I call super.next(Token), and copy its content into the cached Token (I don't use clone() since that allocates a new Token which i don't need).
I don't suffer in my scenario, I'm quite happy w/ how it works and expect to be able to do the exact same thing when I switch to the new API. If that will be the case, then it's fine with me.
Perhaps I'm missing something, but I don't understand what is the performance hit that's been discussed.
We're discussing here basically how we want to treat backwards-compatibility for the old API. In the patch attached here the default implementation of TokenStream.incrementToken() throws a UnsupportedOperationException. If the consumer, i.e. indexer, hits that exception it knows to fallback to the old API. This change has the same disadvantage as trunk currently has: for all streams/filters in the core and contrib we have to implement both the old next(Token) and the new incrementToken().
To avoid this disadvantage Uwe is proposing an elegant change: each TokenStreams owns a reusableToken, which is shared with the consumer (indexer). That's easily possible, because the patch here changes Token to implement all TokenAttribute interfaces. The default implementation of incrementToken() then calls next(Token) and compares the Token next(Token) returns with the reusableToken. If it's not identical then we need to copy all values from the returned Token into the reusableToken, so that the consumer gets the values. This additional copying is what concerns me in terms of performance.
Currently e.g. SinkTokenizer always returns a different Token, which means we would always have to copy. We could change SinkTokenizer to not clone, but call Token.reinit(), which copies. But I think there might be a significant amount of filters out there that do similar things.
Grant: But with the new TokenStream API, you must always clone the attributes before caching, because the TokenFilter only has one final "working" instance per Attribute whose contents change all the time.
In this case it would not be better than wrapping next(Token) with the new api and feeding in the current instance token and copy it back into the instance-one if next(Token) returns a different instance.
Currently e.g. SinkTokenizer always returns a different Token, which means we would always have to copy. We could change SinkTokenizer to not clone, but call Token.reinit(), which copies. But I think there might be a significant amount of filters out there that do similar things.
If SinkTokenizer does this, it would not be very fast even with the old API because it creates a lot of new instances instead of reusing? So the problem here is in SinkTokenizer.
Thanks Michael. Given that, I don't think my private scenario will be affected in terms of performance, since my cached token is a different instance then the one that's been passed on the filters, and I just populate the instance passed on the filters with data from the cached token. So besides me copying char[], the consumer will still share the same instance.
Perhaps we should document that as a best practice or something, in light of this proposal?
Right, SinkTokenizer is only really cost effective when the number of Tokens that are "teed" is roughly less than 1/2, according to my basic tests. After that, just doing a copy field approach is fine.
I wanted to add one additional advantage of my Proposal:
With it, it is possible (if correctly implemented on the Filter side, must think about it one more time), to mix Filters and TokenStreams together regardless if they implement the new or old API. With the UOE or the current trunk solution, all Filters/Streams in the chain must share the same setting useNewAPI or the same implementation state!
I would suggest to add a note in the JavaDocs of the deprecated next(Token) method, that it should for optimal performance always return the given Token or null.
The in 2.4 released CachingTokenFilter and SinkTokenizer do the cloning, so it was kind of our recommended implementation. Chances are that people have similar implementations. They would see a possible performance hit. Do you think that's ok, Uwe?
The problem goes further:
If the users move from the old Token API to the new one, they will get the same performance decrease. We only have one token instance per tokenizer chain with the new API. So they must also copy the values into their cache and back or think about a completely new implementation of what they have done before.
I will post a patch soon (based on your patch), that implements my thoughts and we could compare.
Also what happens if a user starts using the new API and has a TokenStream that adds a different implementation of one or more of the TokenAttribute interfaces?
I think for this case you're proposing that the TokenStream will always add Token right away to the AttributeSource? But that takes away some of the flexibility of the new API?
If the users move from the old Token API to the new one, they will get the same performance decrease.
I'm seeing a ~20% performance gain with using the new approach with one Token instance. Because the new implementation clones once in captureState(), and copies then in restoreState() using copyTo(), which is for Token implemented as reinit(), which copies all members.
The old approach of SinkTokenizer and CachingTokenFilter clones twice. The second clone could be avoided with switching to reinit() also. The old API allows you to do both (double-cloning and clone/copy), the new API forces you to use the more efficient approach.
Also what happens if a user starts using the new API and has a TokenStream that adds a different implementation of one or more of the TokenAttribute interfaces?
You could use the initialize() method I put in there and a user can overwrite it to change the default behavior of putting a Token into the AttributeSource.
Hallo Michael,
I played a little bit around. This patch implements my proposal:
- removes all deprecated API and corresponding wrapper from DocInverter and QueryParser.
- The default TokenStream calls in each of the three possible method the next best one, wrapping the tokens with cloning, as described. Please note: As before, if one extends TokenStream/TokenFilter but does not override one of these methods, indexing will loop and stack overflow. I could add a test in initialize, that tests, if at last one of the methods is overridden (which is a runtime check). An example how this could be done is currently commented out (but was used for some other test).
- I played a little bit, trying to only register the "big" Token instance in the stream, if one does not override incrementToken (see code commented out). But the problem is, that this is then backwards compatible only in one direction: consumers calling incrmentToken succeed always, but some outdated consumers alling next() or next(Token) then fail, because the instance may not be Token, if all producers implement incrementToken and TokenStreams would not register Token as impl. Because this does not work, the TokenStreams and Filters register Token as Impl and use it for wrapping. Because of this, the simple TokenAttribute impls are now useless.
- There is currently no possibility to change the default Token impl, I will think about it during the night!
I tested the attached patch with core (all tests pass, so "new" token streams work correct) and contrib/analyzers (all tests pass, so also old token streams work correct). You can even use mixed impls (not fully tested yet, but you can have e.g. a new-style tokenstream and filter it with an old-style tokenfilter).
With this patch, all core tokenstreams could be updated, to only implement the new API and remove all next(Token) impls.
Uwe
I haven't review the patch yet, but I have a quick question:
is your patch backwards-compatible if a user has a TokenStream or -Filter which returns a custom subclass of Token? And then another one in the chain casts to that subclass? Note that Token is not final. Also not sure how common this scenario is, just came to my mind.
Also, can a user still use the AttributeFactory to use something else but Token?
is your patch backwards-compatible if a user has a TokenStream or -Filter which returns a custom subclass of Token? And then another one in the chain casts to that subclass? Note that Token is not final. Also not sure how common this scenario is, just came to my mind.
This is no problem; I can explain:
Returning something other than Token from next() only makes sense, if the direct consumer can handle it. So A TokenStream that returns these tokens must then be wrapped by a TokenFilter that can handle these Tokens. If there would be any other TokenFilter in between, it is not guaranteed, that this TokenFilter also returns the special Token. When you have this direct relation, the calls from the Filter to the prducers method are directly without wrapping (because both implement the old API). At the point where the indexer consumes the TokenFilter on top, the custom Token is uninteresting and can safely copied into a conventional Token (which is done because nextToken!=attributeInstance).
Also, can a user still use the AttributeFactory to use something else but Token?
As noted in my patch description: This is not possible. One can add additional attributes and use them in his chain (even when old filters are in between, which only handle the Token instance). TokenStream and TokenFilter creates a Token() instance on initialize() and call AddAttributeImpl(). After doing this, it checks, if all Attributes are then subclass of Token, and calls getAttribute(TermAttribute) is called and the result casted to Token (which then should be the same).
One could change this behaviour if he overrides initialize() in one of his classes, but then another TokenSteam/Filter in the chain also initializing, will see, that one of the instances is not Token and will throw a RuntimeException. I tried everything, to be able to handle both pathes (old -> new API, new -> old API), TokenStream and TokenFilter must have a Token instance. In 3.0 or later this can be removed and we will only use the factory to init the attributes.
In my opinion, this is not a problem, because one could still add custom attributes to his chain and the best: he can mix old and new tokenstreams in one chain as he want. The missing flexibility in modifying the instances of Tokenattributes are in my opinion not important (and one instance initializes faster than 5).
I quickly hacked a tool demonstrating my concerns.
Running the attached tool on trunk+my patch yields the following output:
new tokenstream --> proper noun api
The output is identical if the tool is run on Lucene 2.4.
Running the same tool using trunk+uwe's patch yields:
new tokenstream api
This tool might not make much sense, but it shows in what unexpected ways people might use these APIs. It doesn't break API compatibility, but changes runtime behavior - in this case if users have their own subclasses of Token.
Doesn't this mean that we need to change all our Core TokenStream/TokenFilter impls to use clone() instead of instantiating a Token themselves (i.e., using one of its ctors)? If we use clone() you shouldn't experience that problem because the type of the cloned Token will be your subclass, and any core filter/stream can still work with Token and just use its methods.
We can also add a newToken() method to Token, and let the Token extensions return a new Token instance of their type. That is like clone() only it won't copy any characters and initialize fields.
If we think Token should be extended, I think we should also add proper documentation to TokenStream that mentions this possible way of using and our recommended approach.
I don't think we mention subclassing of Token really in the documentation. We also certainly don't prevent it. The tool I wrote works fine with 2.4, if you add other filters to the chain it might not work anymore. But since we don't promise that subclassing of Token works everywhere, that's probably fine.
We're deprecating the old API anyway, so we shouldn't have to introduce new stuff to fully support subclassing Token.
My point here is just that this is a very complex API (even though it looks pretty simple). When I wrote the new TokenStream API patch end of last year I thought about all these possibilities of making backwards compatibility more elegant. But I wanted to be certain to not break any runtime behavior or affect performance negatively. Therefore I decided to not mess with the old API, but rather put the burden of implementing both APIs on the committers during the transition phase. I know this is somewhat annoying, on the other hand, how often do we really add new TokenFilters to the core? Often implementing incrementToken() takes 10 minutes if you already have next() implemented. Just copy&paste and change a few things.
But I'll definitely buy Uwe a beer if he comes up with solution that is more elegant and doesn't have the mentioned disadvantages!
Hi Michael,
in principle your test is invalid. It has other tokenfilters in the chain, which the user has no control on. With the two mentioned filters it may work, because they do not change the reuseableToken instance. But the API clearly states, that the reuseableToken must not be used and another one can be returned.
So this is really unsupported behaviour. If you remove the filters in between, it would work correct. And this could even fail with 2.4 if you put other tokenfilters in your chain.
In my opinion, the advantages of the token reuse clearly overweigh the small problems with (unsupported) usage. The API does exactly, what is menthioned in the API Docs for 2.4.1.
The main advantage is, that you can mix old and new filter instances and you loose nothing...
OK, what about this sentence in Token.java:
When caching a reusable token, clone it. When injecting a cached token into a stream that can be reset, clone it again.
This double-cloning is exactly what CachingTokenFilter and Tee/Sink do, so they preserve the actual Token class type.
You can easily construct an example similar to the tool I attached that uses these streams.
OK, I have a solution:
I write a wrapper class (a reference) that implement all token attribute interfaces but pass this downto the wrapped Token/Subclass-of-Token. Instead of cloning the token when wrapping the return value of next(), I could simply put it into the wrapper. The instance keeps the same, only the delegate is different. Outside users or TokenStreams using the new API, will only see one instance that implements all interfaces.
(in principle the same like your backwards-compatibility thing in the docinverter)
Would this be an idea?
For caching:
I guess you would have to implement the wrapper's clone() method such that it returns what delegate.clone() returns. This would put a clone of the original Token (or subclass) into the cache, instead a clone of the wrapper, which is good. Then the second clone also clones the original Token again and put's it into a second wrapper that the CachingTokenStream owns. Hmm complicated, but should work.
Need to think more about if all mixes of old and new TokenSteams would work... and if this approach affects performance in any way or changes runtime behavior of corner cases...
Gosh, this is like running a huge backwards-compatibility junit test suite in my head every time we consider a different approach.
I think you should try it out and see if you run into problems. This should not be much code to write. You might have to do tricks with Tee/Sink, if the sink is wrapped by a filter with the new API, but the tee wraps a stream with the old API, or vice versa.
I think you should try it out and see if you run into problems. This should not be much code to write.
I am working on that, I have a meeting now, after that.
You might have to do tricks with Tee/Sink, if the sink is wrapped by a filter with the new API, but the tee wraps a stream with the old API, or vice versa.
This is currently working without any problems, but I want to add a test-case, that explicitely chains some dummy-filters in deprecated and not-deprecated form and looks whats coming out. But it should work.
I am working on that, I have a meeting now, after that.
Good luck. I'm off to bed...
Attached is a new patch, that implements the last idea:
- There is no more copying of Tokens, so the API should have the same speed (almost) as before.
- Per default, the chain of TokenStreams/TokenFilters can be mixed completely (test that explicitely tests this is still missing), the drawback is, that there is only one attribute instance called TokenWrapper (package private) that manages the exchange of the Token instance behind.
- If the user knows, that all tokenizers in his JVM implement incrementToken and do not fallback to next(), he can increase speed by using the static setter setOnlyUseNewAPI(true). In this case, no single TokenWrapper is initialized and code will use the normal Attribute factory to generate the Attributes. If some old code is still available or your consumer calls next(), you will get an UOE during tokenization. The same happens, if you override initialize() and instantiate your attributes manually without super.initialize().
- When the old API is removed, TokenWrapper and large parts inside TokenStream can be removed and incrementToken() made abstract. This is identical to setting onlyUseNewAPI to true.
- the api setting can only be static, because the attribute instances are generated during construction of the streams and so a later downgrade to TokenWrapper is not possible.
Documentation inside this patch enforce, that at least all core tokenizers and consumers are conformant, so one must be able to set TokenStream.setOnlyUseNewAPI to true and then use StandardAnalyzer without any problem. When contrib is transformed, we can extend this to contrib.
Because the code wraps the old API completely, all converted streams can be changed to only implement only incrementToken() using attributes. Super's TokenStream.next() and next(Token) manage the rest. There is no speed degradion by this, it is safe to remove (and all will be happy)!
Uwe
Sorry, small bug in cloning inside next(): the POSToken-test was failing again. But now it works also correct.
Hi Michael,
I did not do any performance tests until now, I think you have the better knowledge about measuring tokenization performance. Important would be to compare perf of:
- Old API with useNewAPI=true
- Old API with useNewAPI=false
- My impl with defaults (onlyUseNewAPI=false)
- My impl with onlyUseNewAPI=true
For all tests, you should only use conformant streams (e.g. from core).
An good additional test would be to create a chain that has completely implemented incrementToken() and one only suplying next() for some chain entries.
Is this hard to do?
You can run tokenize.alg which invokes the ReadTokenTask, which iterates on a TokenStream. You'll probably need to modify the .alg file to create a different analyzer/token stream each time, and I think this can be done by the "rounds" syntax in benchmark.
I'm looking at TokenStream.next():
public Token next(final Token reusableToken) throws IOException { // We don't actually use reusableToken, but still add this assert assert reusableToken != null; checkTokenWrapper(); return next(); } /** Returns the next token in the stream, or null at EOS. * @deprecated The returned Token is a "full private copy" (not * re-used across calls to next()) but will be slower * than calling {@link #next(Token)} instead.. */ public Token next() throws IOException { checkTokenWrapper(); if (incrementToken()) { final Token token = (Token) tokenWrapper.delegate.clone(); Payload p = token.getPayload(); if (p != null) { token.setPayload((Payload) p.clone()); } return token; } return null; }
This seems like a big performance hit for users of the old API, no? Now every single Token will be cloned, even if they implement next(Token), as soon as the users have one filter in the chain that doesn't implement the new API yet.
The code is almost identical to before, the old code also copied the token to make it a full private copy.
There are three modes of operation:
- if incrementToken is implemented, docinverter will use it (the code always calls incrementToken, so no indirection)
- if next(Token) is implemented, the docinverterwill call incrementToken which is forwarded to next(Token), which is cheap
- if only next() is implemented, the docinverter will call incrementTojen, which forwards to next(Token) and this forwards to next(). But this is identical to before, only one indirection more: the old code got useNewAPI(false) and called next(Token) which forwarded to next()
So for indexing using the normal indexing components (docinverter), the code is never cloning more that with your code.
There is one other case: if you have an old consumer calling nextToken(Token), the tokenizer only implemented incrementToken, then you will get a performance degradion. But this is not the indexing case, it is e.g. reusing the tokenizer in a very old e.g. QueryParser. I did not find a good way to pass directly for this special case to incrementToken(). The problem is also, that incrementToken uses the internal buffer and not the supplied buffer.
Ah I understand the problem: As I told, if a consumer (like a filter() calls next(Token) on the underlying filter), which does not implement this or implements the new API, he will get a performance decrease because of cloning. I think, we should simply test this with the benchmarker. Mixing old and new API is always a performance decrease.
Ah I understand the problem: As I told, if a consumer (like a filter() calls next(Token) on the underlying filter), which does not implement this or implements the new API, he will get a performance decrease because of cloning. I think, we should simply test this with the benchmarker. Mixing old and new API is always a performance decrease.
Yes that's what I mean. But I think this will almost be the most common use case: I would think most users have chains that mix core streams/filters with custom filters. Also I assume most users who need high performance switched from next() to next(Token) by now. These users will see a performance degradation, which I predict will be similar or worse as going back to using next(), unless they implement the new API in their filters right away.
So those users will see a performance hit if they just do a drop-in replacement of the lucene jar.
I could change the calling chain:
incrementToken() calls next() calls next(Token), would this be better. next(Token) would per default set the delegate to the reuseable token. hmhm - thinking about it. Where is then the degradion?
I have a solution to build in some shortcuts:
in initialize I use reflection (see the earlier patch) to find out, which of the three methods is implemented (check if this.getClass().getMethod(name,params).getDeclaringClass() == TokenStream.class, when this is true, the method was not overridden).
in incrementToken() the method checks if either next(Token) or next() is implemented and calls direct. The same in the other classes. next() should be ideally never called then.
I will post a patch later.
Should I wait to put in the Highlighter update till you guys are done here?
Should I wait to put in the Highlighter update till you guys are done here?
You can start with highlighter, if this patch goes through, we can remove the next() methods from all tokenizers.
For consumers like the highlighter, there will be no need anymore to switch between old/new api. Just use the new API, it will also work with old tokenizers.
I'm not convinced yet that we will be able to remove the implementations of next() and next(Token).
Mark, I'm not familiar with what changes you need to make to the highlighter, but you should not rely yet on the fact that next() and nextToken() won't have to be implemented anymore.
Here my solution: The three default methods are now optimized to use the shortest path to the by subclasses implemented iteration method. The implemented iteration methods are determined by reflection in initialize().
Cloning now only done, if next() is directly called by a consumer, in all other cases the reuseableToken is used for passing the attributes around.
The new TokenStream also checks in initialize, that one of the "abstract" methods is overridden. Because of this TestIndexWriter and the inverter singleton state was updated to at least have an empty incrementToken(). Because of this check, nobody can create a TokenStream, that loops indefinite after calling next() because no pseuso-abstract method was overridden. As incrementToken will be abstract in future, it must always be implemented, and this is what I have done.
Slightly changes tool yields on 2.4 and identically on trunk + my patch:
new tokenstream --> proper noun api new tokenstream --> proper noun api new tokenstream api
On trunk + your latest patch:
new tokenstream --> proper noun api new tokenstream api Exception in thread "main" java.lang.ClassCastException: org.apache.lucene.util.AttributeSource$State at org.apache.lucene.analysis.SinkTokenizer.next(SinkTokenizer.java:97) at org.apache.lucene.analysis.TestCompatibility.consumeStream(TestCompatibility.java:97) at org.apache.lucene.analysis.TestCompatibility.main(TestCompatibility.java:90)
It runs three tests. The first is good with your patch; the second doesn;t seem to preserve the right Token subclass; the third throws a ClassCastException. I haven't debugged why...
You can probably fix CachingTokenFilter and tee/sink to behave correctly. But please remember that a user might have their own implementations of something like a CachingTokenFilter or tee/sink, which must keep working.
Btw: SinkTokenizer in my patch has a small bug too. I need to throw a UOE in incrementToken() if it was filled using the old API.
It should probably also throw a UOE when someone tries to fill it with both, old and new API streams. And that this is not allowed must be made clear in the javadocs.
Exactly: The problem is in SinkTokenizer. when calling next(Token) the result is casted to Token, which does not work (the iterator only contains either Tokens or States, dependent on what was added. As SinkTokenizer and TeeTokenFilter may use different APIs it crahes.
The problem with the test is, that depending on chaining with old/new APIs the iter may conatin wron type. This can be fixed by removing next(Token) (preferred) or incrementToken() . The problem is that dependent on chaining it is not clear which method is called and the new/old API should not share the same state information.
Because the problem is related new/old API, we should simply remove the old API from both filters, so they share the same instances in all cases! Then we do not need UOE.
I will look into and check, why the Token in the second test is not preserverd
The second test does not work, because it always uses per default incrementToken.
By the way, the APIdocs and behaviour changed with these three classes, TeeTokenFilter, SinkTokenizer and CachingTokenFilter: e.g. getTokens() does not return what is noted. For backwards-compatiblility we should deprecate the current versions of these class [and only let them implement next(Token)]. They can then be used even together with the new API, but they always work on Token instances. When I remove incrementToken from them your test passes complete.
For the new API there should be new classes, that use attributesource and restorestate to cache and so on.
But for current backwards compatibility (you mentioned, somebody have written a similar thing): If the user's class only uses next(Token) it will work as before. The problem is mixed implementations of old/new API and different cache contents. This is not a problem of my proposal!
Again: We should remove the double implementations everywhere. In these special cases with caches, where the cache should contain a specific class (Tokens or AttributeSource.State), two classes are needed, one deprecated.
But: what do you think about my latest patch in general?
Small updates, before I go to sleep. This patch removes the incrementTokenAPI from the three caching classes. It also fixes the double cloning of the payload in next() when the token is cloned directly.
There is still one small problem, that your test – I hate it... – fails again, if I remove next(Token) from StopFilter or LowerCaseFilter.
Sorry, last patch was invalid (did not compile), I forgot to to revert some changes before posting.
Attached patch has still problems in TeeTokenStream, SinkTokenizer and CachingTokenFilter (see before), but fixes:
- double cloning of payloads
- the first of your tests works correct, even if i remove next() from StopFilter and/or LowercaseFilter
For backwards-compatiblility we should deprecate the current versions of these class [and only let them implement next(Token)].
I agree. With my patch the Tee/Sink stuff doesn't work in all situations either, when the new API is used. We need to deprecate tee/sink and write a new class that implements the same functionality with the new API.
OK, we can merge our patches then! At the moement I see no real show-stoppers with the current aproach, have you tested thoroughly and measured performance? All tests from core and contrib/analyzers pass, the problems with your last TestCompatibility.java are Tee/Sink problems.
The interesting part (if we stay with my not-so-elegant-anymore solution because of reflections hacks), would be to remove the deprecated next(Token) methods from core streams, which would be a great code cleanup!
By the way, I tested Solr's token streams also after updating the lucene jar file. All tests pass (only some not related ones fail because of latest changes in Lucene trunk and some compile failures because of changes in no-released APIs).
Solrs TokenStreams are all programmed with the old-api, but they get inverted using incrementToken from our patch.
Also the solr query parser seems to work.
Again an update: Unified the reuseable tokens in the TokenWrapper.delegate. No it is always set after each action, so no state changes left out.
By the way, I tested Solr's token streams also after updating the lucene jar file. All tests pass (only some not related ones fail because of latest changes in Lucene trunk and some compile failures because of changes in no-released APIs).
Solrs TokenStreams are all programmed with the old-api, but they get inverted using incrementToken from our patch.
Also the solr query parser seems to work.
Did you look at the performance on this?
I only tested performance with the lucene benchmarker on the various standard analyzers. The tokenizer.alg produces after the patch the same results as before in almost the same time (time variations are bigger than differences). With an unmodified benchmarker, this is clear, benchmarkers tokenizer task call still the deprecated next(Token) and as all core analyzers still implement this directly, so there is no wrapping. I modified the tested tokenstreams and filters in core, that were used, and removed next(Token) and left only incrementToken() avalilable, in this case the speed difference was also not measureable in my configuration (Thinkpad T60, Core Duo, Win32). I also changed some of the filters to implement next(Token) only, others to only incrementToken(), to have a completely mixed old/new API chain, and still the same results (and same tokenization results, as seen in generated indexes for wikipedia). I also changed the benchmarker to use incrementToken(), which was also fine.
To have a small speed incresase (but I was not able to measure it), I changed all tokenizers to use only incrementToken for the whole chain and changed the benchmarker to also use this method. In this case I was able to TokenStream.setOnlyUseNewAPI(true), which removed the backwards-compatibility-wrapper and the Token instance, so the chain only used the unwrapped simple attributes. In my opinion, tokenization was a little bit faster, faster than without any patch and next(Token). When the old API is completely removed, this will be the default behaviour.
So I would suggest to review this patch, add some tests for heterogenous tokenizer chains and remove all next(...) implementations from all streams and filters and only implement incrementToken(). Contrib analyzers should then only be rewritten to the new API without the old API.
The mentioned bugs with Tee/Sink are not related to this bug, but are more serious now, because the tokenizer chain is no longer fixed to on specfic API variant (it supports both mixed together).
Sorry, Uwe, I was really busy today. I'll try to review the latest patch as soon as I can.
After committing TrieRange to core, here some updates to the patch (TrieRange used setUseNewAPI in its tests).
This patch now also has all next(Token) implementations removed in core (excluding Tee, Sink and CachingToken.)
During my tests I found a small problem with AttributeSource.getAttributesIterator() & Co.:
If you, for example, print TokenStream.toString(), but have all the 6 small interfaces like TermAttribute, FlagsAttribute,.. in the Token class implemented, you will get 6 times the same string, because the iterator enumerates the Token instance 6 times. The same happens when cloning all Attributes or doing a copyTo of all attributes, equals of AttributeSource: They are handled 6 times. So the list of AttributesImplementations should be unique. So in principle the instances should also put into a LinkedHashSet in parallel to the Interface->Implementation mapping which is unique only on the interface side. The getAttributesIterator then should return the iterator on the implementation set. It would fix all double handling and would be faster for our new standard case for backwards compatibility, where e.g. Token is copyTo()'ed 6 times on captureState and so on.
How is this patch coming?
Do you think MemoryIndex will still need to be changed to check the TokenStreams API (al la the QueryParser), or is this going to clean it up?
Michael did not yet respond (I think he is busy).
For me this patch works perfectly, all tests pass and performance is same as before. The big difference is, that you do not need to implement both APIs in your TokenStreams (the current patch removes the old API completely from core tokenizers). Consumers like the highlighter can simply only use the new API, but will also work when calling the old API. And you can even mix old an new API TokenFilters in one chain.
In principle this is all we need. Maybe you can also do some tests!
By the way: Token is no longer deprecated, it is just an implementation of the six standard attributes in one class!
Ah, nice - I had to re-implement it for one or two spots, will be nice to just keep the old.
That sounds great Uwe - I had thought thats where this was going but Michael seemed to have some doubts.
I'll try it out with the Highlighter changes I have made.
The big difference is, that you do not need to implement both APIs in your TokenStreams (the current patch removes the old API completely from core tokenizers).
Thats a big back compat violation though right ?
Ok another version of my hated compatibility patch. What I did was to copy the exact 2.4 implementations of TeeTokenFilter and SinkTokenizer into the test class and use them in the test I previously had. It performs the test twice: using the old and the new API.
On my patch it produces:
Testing old API... new tokenstream --> proper noun api new tokenstream --> proper noun api new tokenstream api Testing new API... Exception in thread "main" org.apache.lucene.analysis.TokenStream$UnsupportedTokenStreamAPIException: This stream or filter does not support the new Lucene 2.9 TokenStream API yet. at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:114) at org.apache.lucene.analysis.TestAPICompatibility.consumeStreamNewAPI(TestAPICompatibility.java:145) at org.apache.lucene.analysis.TestAPICompatibility.test2(TestAPICompatibility.java:119) at org.apache.lucene.analysis.TestAPICompatibility.main(TestAPICompatibility.java:83)
On your patch, Uwe:
Testing old API... This TokenStream api This TokenStream api This TokenStream api Testing new API... This TokenStream api This TokenStream api This TokenStream api
The big difference is, that you do not need to implement both APIs in your TokenStreams (the current patch removes the old API completely from core tokenizers).
Thats a big back compat violation though right ?
No it isn't. It is not completely removed, the abstact TokenStream emulates the old next() methods, so even if all core tokenizers do not implement next(), they still export the next(Token) and next() methods from TokenStream.
Ah, okay. Sounds like some magic you have worked here Uwe - great stuff. Very nice to not have to impl twice everywhere.
Hi Michael,
this was a small bug in the translation next(Token) -> incrementToken(). If the call to incrementToken itsself calls other filters, that itsself delegate back to next() the wrapper.delegate can change during the incrementToken call. I fixed this, now it works:
[junit] ------------- Standard Output --------------- [junit] [junit] Test old API... [junit] new [junit] tokenstream --> proper noun [junit] api [junit] new [junit] tokenstream --> proper noun [junit] api [junit] new [junit] tokenstream [junit] api [junit] [junit] Test new API... [junit] new [junit] tokenstream --> proper noun [junit] api [junit] new [junit] tokenstream --> proper noun [junit] api [junit] new [junit] tokenstream [junit] api
Are you happy now? I think Mark is also very interesting in this and fast progress.
In my opinion, only the new Tee/Sink/Cache impls are needed and the fix for the duplicate attributes in toString().
But I'll definitely buy Uwe a beer if he comes up with solution that is more elegant and doesn't have the mentioned disadvantages!
By the way: I want to have my beer!!!
Sorry, I don't want to be the bad guy here, just trying to mention things that come to my mind. Maybe this one is negligible?
If the user had extensions of streams/filters that in 2.4 didn't declare next() or next(Token) as final and calls super.next() or super.next(Token), then I think you'll get different behavior or exceptions. E.g.:
private static class ExtendedSinkTokenizer extends SinkTokenizer { public Token next() throws IOException { Token t = super.next(); // do something with t return t; } }
In your example, you mean SinkTokenizer is from the core and only implements incrementToken(), the others are handled by the default in TokenStream (using this reflection-based redirect?)
I tested this with overriding LowerCaseFilter. In my latest patch LowerCaseFilter only implements incrementToken(), next() and next(Token) are handled by the forwarder in TokenStream. If you override next(Token) and consume this stream using incrementToken(), then your overriden next(Token) is never called.
This exact same backwards-compatibility problem was also there when changed from next() to next(Token). If some core filter only implemented next(Token) and a subclass overrides only next() (because older code-base) and calls super() there, the same happened: the tokenizer code called next(Token), but next() was never called. You have the same problem with our recent deprecations in DocIdSetIterator.
I would no see this as a problem, this problem can be found everywhere in Lucene, where we deprecated (originally abstract) methods and replaced by newer ones calling the old ones as default.
Not sure if we're talking about the same thing here. The problem is not that next() in the extended class doesn't get called. It gets called, but the super.next() call throws either a ClassCastException, when consuming using the old API, or produces a different result than before uwhen consuming the new API.
This is the output on your patch:
Testing old API... new tokenstream --> proper noun api new tokenstream --> proper noun api new tokenstream api new tokenstream --> proper noun api (new,12,15) new (tokenstream,16,27) tokenstream (api,28,31) api null java.lang.ClassCastException: org.apache.lucene.util.AttributeSource$State at org.apache.lucene.analysis.SinkTokenizer.next(SinkTokenizer.java:97) at org.apache.lucene.analysis.TestCompatibility.consumeStreamOldAPI(TestCompatibility.java:194) at org.apache.lucene.analysis.TestCompatibility.test2(TestCompatibility.java:142) at org.apache.lucene.analysis.TestCompatibility.main(TestCompatibility.java:87) Testing new API... new tokenstream --> proper noun api new tokenstream --> proper noun api new tokenstream api new tokenstream --> proper noun api (new,12,15) new (tokenstream,16,27) tokenstream (api,28,31) api null new tokenstream api
On my patch:
Testing old API... new tokenstream --> proper noun api new tokenstream --> proper noun api new tokenstream api new tokenstream --> proper noun api (new,12,15) new (tokenstream,16,27) tokenstream --> proper noun (api,28,31) api null new tokenstream api Testing new API... org.apache.lucene.analysis.TokenStream$UnsupportedTokenStreamAPIException: This stream or filter does not support the new Lucene 2.9 TokenStream API yet. at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:114) at org.apache.lucene.analysis.TestAPICompatibility.consumeStreamNewAPI(TestAPICompatibility.java:210) at org.apache.lucene.analysis.TestAPICompatibility.test3(TestAPICompatibility.java:162) at org.apache.lucene.analysis.TestAPICompatibility.main(TestAPICompatibility.java:93) org.apache.lucene.analysis.TokenStream$UnsupportedTokenStreamAPIException: This stream or filter does not support the new Lucene 2.9 TokenStream API yet. at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:114) at org.apache.lucene.analysis.TestAPICompatibility.consumeStreamNewAPI(TestAPICompatibility.java:210) at org.apache.lucene.analysis.TestAPICompatibility.test4(TestAPICompatibility.java:183) at org.apache.lucene.analysis.TestAPICompatibility.main(TestAPICompatibility.java:98)
Not sure if we're talking about the same thing here. The problem is not that next() in the extended class doesn't get called.
I think we are talking about the same. We have exact the same problem, if you override OpenBitSetIterator, but only implement next() instead of nextDoc(). next() is never called, because the new API (nextDoc()) is handling everything. It's exactly the same. You have this always, if you deprecate former abstract methods and make the new one call the old one. I a subclass overrides the new API, the old API is dead for this class. A further (3rd party) subclass then overrides the deprecated method, it is never called -> bäng. Other example: Filter.
We have this everywhere (DocIdSetIterator next() vs nextDoc(), Filter getBits() vs. getDocIdSet(), TokenStream very old API vs. old API).
It gets called, but the super.next() call throws either a ClassCastException, when consuming using the old API, or produces a different result than before uwhen consuming the new API.
A ClassCastException? Why that? super.next() is always available (at least TokenStream on top of the hierarchy defines it).
One solution around this poblem would be to always implement both incrementToken() and next(Token) in non-final classes that tend to be overridden (CharTokenizer). In this case you are right. So I should revert the removal of next(Token). But very old code only overriding next() still fails.
But final things like StandardFilter do not need to have both implementations.
By the way: I want to have my beer!!!
I decided a while ago that you'll definitely get it, because of your admirable tenaciousness.
After thinking one round about it again. Your original patch with the UnsupportedOperationException have the same problem. If one overrides next(Token) in one non-final TokenStream/Filter, the overriden method is never called.
And it is also not enough to always override both methods in subclass-able streams/filters. With both batches we break BW compatibility for users that do not override TokenStream or TokenFilter directly.
Even my last comment on java-dev is not correct. Somebody who overrides a Filter (not direct subclass of TokenStream/Filter/Tokenizer, or the in-between classes are also of this user and only override deprecated APIs) and only override next(Token) (e.g. subclass of CharTokenizer), the overriden method will never be called, if the Filter/Stream is top-level in the filter chain
I think, we must have somewhere a BW break and clearly document it in the following way in CHANGES.txt:
- You can mix old and new filters/streams without problems.
- Such mixed chains can be consumed with both new/old API
- To have this working, always override the abstract TokenStream/TokenFilter/Tokenizer directly and see all other core classes as "final". In all other cases your filters/token streams will have unpredicatable behaviour.
Uwe
Mr. Busch my friend, I'll buy both you and Uwe many beers if you resolve this issue soon!
Alright, I hope you are coming to Oakland in November!
I had a few (literally) sleepless nights last week to meet some internal deadlines; but it looks like I'll now have time to work on Lucene, so I'll continue on this issue tonight!
Attached is a new patch to current trunk. I did a hard work to make it compatible to also test-tag (see the java-dev discussion last weekend: test-tag was not really testing against 2.4, the bw-branch was created in November from trunk, when the new TokenStream API was already committed):
- All tests core, contrib and tag pass correctly (and no changes to bw-branch needed!)
- Michaels latest TestCompatibility also passes
- All non-final and abstract Tokenizers implement both next(Token) and incrementToken() for maximum compatibility (but there are still problems mentioned above). This problems are always there and cannot be worked around, even with Michael's initial patch and at other places in Lucene, too (Filters, HitCollector).
- CachingTokenFilter, TeeTokenizer and SinkTokenizer and the corresponding tests are reverted to 2.4 API and deprecated. There is not yet an implementation in new classes using solely the new API available. This is the only thing on the TODO-list for this case. I suggest the following name: CachingAttributesFilter; but for TeeTokenizer and SinkTokenizer I have no idea
In my opinion, this is ready to commit and the non-deprecated Caching/Sink/Tee things could be a separate issue.
We also need to write some conclusion or howto for CHANGES.txt, mentioning the problems with the switch to the new API.
Thanks for all your hard work here, Uwe.
I think this patch is as good as it can be for achieving the goal of being able to combine the old and the new API.
And I agree that my patch that I posted here has the same potential backwards-compatibility problems regarding inheritance.
I think in the majority of use cases we're fine here. Only the corner cases make me a bit nervous. I think the case I feel most uncomfortable with is when people use Lucene + some external analyzer package + their own subclasses. If they use Lucene 2.9, the external package is not upgraded to the new API yet, but they did upgrade their own classes already to the new API, then they might run into undefined problems. However, I don't even know how many of such "external analyzer packages" exist (well, I think Grant mentioned he was working on one...)
And I still just have this not-going-away slightly bad feeling in my gut that there are still other corner case problems we haven't thought about yet. What makes this feeling worse is the fact that those problems might not result in exceptions, but in unexpected and hard-to-find search problems, because the wrong tokens were indexed.
The current patch uses reflection extensively to figure out which of the three APIs the user has implemented. The comments above mention the possible problems. The solution is cool, but also a bit hack-ish (no offense Uwe, you called it that yourself )
So, having said all this, I'd like other people to chime in here and give their opinion. I'm okay with committing this solution if everyone else is too.
I think the only solution to not break compatibility at all is to not touch the old API at all and provide APIs that switch on/off using the new API. That's what the code in trunk currently does. It has the major disadvantage that it doesn't allow combining the old and new API in the same chain, and that we have to implement both APIs in core Lucene until the old API is fully removed.
So Mike, Grant, Mark, or others, could you please comment here?
PS: Uwe, in any case, your solution is cool and I like how cleverly you solved the problems!!
Thanks!
Before going to bed yesterday, I remembered the fact, that the attributes iterators list the interfaces and not the instances. I will fix this in a further patch to also have the possibility to iterate over the instances and e.g. only clone instances.
Currently when you clone or toString(), the big Token instance for all 6 attributes is cloned/printed 6 times. This is the only thing, I forgot yesterday. I will think about it and maybe add an attributesInstancesIterator method or something like that.
What do we do with Tee/Sink/Cached?
I agree that in any case we should deprecate those three classes and write new ones. I think we need to merge Tee/Sink into one actually. Maybe name it TeeSinkTokenFilter And it gets a method like
public TokenStream getSinkTokenStream();
That way we can ensure that the sink tokenstream has exactly the same attributes as the TeeSinkTokenFilter. Otherwise, a user could e.g. create the SinkTokenizer first, add some attributes to it, before calling new TeeTokenFilter(in, sink).
I'm still trying to wrap my brain around this issue
So to sum up where things [seem] to stand here:
- We are switching to separate interface + impl for token
attributes. This is nice because a single class (eg Token.java)
can provide multiple attrs, it makes cloning more efficient, etc.
- We are keeping good old Token.java around; it simply implements all
the attrs, and is very convenient to use.
I think this is a great improvement to the new tokenStream API. It's
also a sizable perf gain to clone only one impl that has only the
attrs you care about.
Uwe then extended the original patch:
- Use reflection on init'ing a TokenStream to determine which of the
3 APIs (old, newer, newest) it implements
- Indexer (and in general any consumer) now only has to consume the
new API. Any parts of the chain that aren't new are automatically
wrapped.
- Core tokenizers/filters (and contrib, when we get to them) only
implement the new API; old API is "wrapped on demand" if needed
- One can mix & match old, new tokenizers, though at some perf cost.
But that's actually OK since the original patch it's "all or
none", ie your chain must be entirely new or old
- I think a chain of all-old-tokenizer/filters and all-new
wouldn't see a perf hit? Wrapping only happens when there's a
mismatch b/w two filters in the chain?
I'm tentatively optimistic about the extended patch... (though these
back-compat, corner cases, etc., are real hard to think about). I
agree it's doing alot of "magic" under the hood to figure out how to
best wrap things, but the appeal of only implementing the new api in
core/contrib tokenizers, only consuming new, being free to mix&match,
etc, is strong.
One concern is: TokenStream.initialize looks spookily heavy weight; eg
I don't know the "typical" cost of reflection. I think there are
likely many apps out there that make a new TokenStream for ever field
that's analyzed (ie implement Analyzer.tokenStream not
reusableTokenStream) and this (introspection every time) could be a
sizable perf hit. Uwe was this included in your perf tests?
One concern is: TokenStream.initialize looks spookily heavy weight; eg I don't know the "typical" cost of reflection. I think there are likely many apps out there that make a new TokenStream for ever field that's analyzed
At this point, I'd bet that's still the norm.
Some updates:
- Added missing License headers
- Differentiate between iteration over attributes or attribute impls. The probleme here is, that the unique attribute impls are not known, so a set must be created (which is done on-the fly). As the attributes member is currently protected and can be modified (even by TokenFilters), there is no way around this. Michael: Do you have any idea about that? During cloning, state capturing and toString() only the unique instances should be visited.
Michael: Some more questions: The superinterface Attribute does not contain clone(), but the interface TokenAttribute does. Why these two interfaces (because AttributeImpl has clone). In my opinion, clone() should be part of Attribute and TokenAttribute can be removed.
To the performance questions:
The new Token API uses reflection very intensive, even without the changes from me. Everytime, when you add an attribute, the instance is checked for all implemented interfaces (via reflection), see addAttributeImpl(). This means, creating ne instances of TokenStreams is much more costly than with the old API (with the new "smart" TokenStreams and also without). I will prepare a small performance comparison (something like "time to create one million WhitespaceAnalyzer's TokenStreams" with old API, new API (as in trunk) and newest API (this patch)).
Token is still deprecated in the latest patch - we are not going to deprecate it though, right?
Mike implemented a nice idea to solve the problems with tokenstreams overriding deprecated methods in LUCENE-1678.
I will try this out here and also fix the problems with # of attribute instances != # of attributes and the iterator problems because of this.
New patch with cleanup and fixing the last problems and inconsistencies in the
AttributeSource with captureState and toString and so on, because attributes
can now be implemented by more than one impl and one impl can implement more
than one attribute (best example: Token). Renamed some methods for that and made sure, that the two Maps holding attributes and instances are private and cannot be modified (to prevent users from making bad things). There are only the public methods and iterators (unmodifiable) available (with changed semantics).
Missing is still the new CachingAttributesFilter and the new TeeSinkTokenizer-combi for the new API (the 3 old ones are already deprecated).
I also removed some of the "old" captureState methods with AttributeSource. Also added testcases for that.
I also removed the TokenAttribute and moved clear into Attribute, as the base class AttributeImpl always implements this method.
After that 1693 needs only the idea from LUCENE-1678, to detect if in non-final
classes, any subclass overrides deprecated methods. Problematic core token
streams are:
- ISOLatin1Filter (no-final and deprecated, so maybe it should not implement
incrementToken at all) -> would be fixed then. - KeywordTokenizer should normally be final, but is not -> needs this
special trick - StandardTokenizer is the same, should be final, but isn't
Contrib analyzers could have this prob, too, but this must be checked when doing the
transformation there. If the backwards wrapper is available, we could do
this like with the analyzers in LUCENE-1678.
I forgot:
I also added the TestCompatibilty class as a testcase and implemented ASCIIFoldingFilter with new API.
Hi Uwe,
I modified the compatibility test so that it is now a junit and I'd like to commit it too.
It's failing right now, because it's testing a lot of different combinations. I need to check if all of those different tests are actually valid, because we're saying you can't use Tee/Sink with the new API anymore.
I'm also working on a new TeeSinkTokenizer that will be the replacement. However, we have to make sure that everyone who is upgrading to 2.9 (without code modifications, just after 2.x -> 2.9 jar replacement) and using the old Tee/Sink doesn't see a different behavior.
So if people are using streams/filters that implement next(Token) I think the performance should be comparable - even though there's also a (hopefully small) performance hit to expect because of more method calls and if checks.
However, if people are using next() in a chain with core streams/filters, then every single token will now be cloned, possibly multiple times, right?
/** Returns the next token in the stream, or null at EOS. * @deprecated The returned Token is a "full private copy" (not * re-used across calls to next()) but will be slower * than calling {@link #next(Token)} instead. */ public Token next() throws IOException { checkTokenWrapper(); if (hasIncrementToken) { return incrementToken() ? ((Token) tokenWrapper.delegate.clone()) : null; } else { assert hasReusableNext; final Token token = next(tokenWrapper.delegate); if (token == null) return null; tokenWrapper.delegate = token; return (Token) token.clone(); } }
However, if people are using next() in a chain with core streams/filters, then every single token will now be cloned, possibly multiple times, right?
That was a similar problem in 2.4, whenever you call next() to consume a stream, every token is a new instance. Only cloning WAS not needed but its now needed for BW compatibility ("full private copy").
In my opinion, this is neglectible, as indexing speed is important, and the indexer always uses incrementToken(). If somebody written an own query parser that uses next() to consume speed is not really important. And it is deprecated and in the docs (even in 2.4) stands: "but will be slower than calling next(Token)".
There is one case, when it also affects you (during indexing). If you have an old-style tokenfilter that calls next() on the next stream that is new-api, it would clone. In my opinion, the speed is about the same like before:
- the 2.4 code created a new uninitialized token instance and the filter then filled it with data, you have to initialize the char arrays and so on.
- here the token from the TokenWrapper-Attribute is reused (no allocation costs for arrays and so on), but you have to clone the Token (to be full private).
So if people are using streams/filters that implement next(Token) I think the performance should be comparable - even though there's also a (hopefully small) performance hit to expect because of more method calls and if checks.
I found no performance hit, it is about same speed. The varieties between tests is bigger than a measureable performance impact. The other sensitive thing (TokenWrapper): The wrapping using TokenWrapper was in the original indexing code, too (this BackwardsCompatibilityStream private class).
The if checks are all on final variables, so can be optimized away by the JVM. The method calls are inlined, as far as I have seen.
It's failing right now, because it's testing a lot of different combinations. I need to check if all of those different tests are actually valid, because we're saying you can't use Tee/Sink with the new API anymore.
Have you seen my backwards compatibility test, too? It is a copy of yours (with some variation)? The Lucene24* classes were removed, because Tee/SinkTokenizer werde reverted to their original 2.4 status in the patch (only implement old API).
As far as I see (not yet tried out), you try to test new-style-API streams with the old Tee/Sink tokenizer, that is deprecated. You were not able to do this before 2.9 (no new API) and so the bw problem is not there. If you rewrite your streams with new API, you should use TeeSinkTokenizer, too.
I changed TokenStream to use real final boolean variables to help the optimizer to optimize away the if checks.
By the way: The problem with initialize() and lateInitialize() [new] is the typical example, why you should never call non-private or non-final methods from a ctor (Sun has a big warning in the documentation). The problem: During initialize() the final fields are not yet initialized, because initialize() is called before the ctor of the actual class (which initializes the fields) is called. Very nasty. So I moved the checks, which methods are available to a private lateInitialize().
In my opinion, we should remove the protected initialize() from AttributeSource and let every subclass do its initialization using a (possibly) private/final method on its own (must be called in two ctors). What was the idea behind that?
There is one case, when it also affects you (during indexing). If you have an old-style tokenfilter that calls next() on the next stream that is new-api, it would clone. In my opinion, the speed is about the same like before:
Yes this is exactly what I mean. Not sure if cloning in this case is slower than creating a new empty instance; if yes, it's probably not significant.
As far as I see (not yet tried out), you try to test new-style-API streams with the old Tee/Sink tokenizer, that is deprecated. You were not able to do this before 2.9 (no new API) and so the bw problem is not there. If you rewrite your streams with new API, you should use TeeSinkTokenizer, too.
You are right - it fails because it uses a new attribute that will not be cached in the Tee/Sink. So I agree that this is not a valid test if we say that Tee/Sink only supports the old API. I will remove the cases that are invalid.
Thinking out loud here again: What if a user uses Lucene in combination with a third-party jar containing TokenStreams, and also own implementations. Are there use cases where it would be necessary for us to provide a switch to run in only-old-API mode?
In my opinion, we should remove the protected initialize() from AttributeSource and let every subclass do its initialization using a (possibly) private/final method on its own (must be called in two ctors). What was the idea behind that?
Definitely. I should start using the nocommit comments Mike puts into early patches. I never wanted the initialize() method to stay for similar reason that you mentioned here. I added it for a test, because I was lazy.
What we really should have is an AttributeSource constructor that takes an AttributeFactory. You need to be able to set the factory in the ctor, because TokenStreams usually add attributes in their ctor.
What we really should have is an AttributeSource constructor that takes an AttributeFactory. You need to be able to set the factory in the ctor, because TokenStreams usually add attributes in their ctor.
I have done this here locally. The ctor that takes an other AttSource uses the factory of the delegate, do we need also a ctor with another AttSource and the factory? The case may be
I also added this ctor to TokenStream, but not TokenFilter (as it uses the factory from the filtered stream).
I also cleaned up the initialization of TokenStream. Now the tokenWrapper and all the booleans are final.
If you are interested I could post my current patch.
Thinking out loud here again: What if a user uses Lucene in combination with a third-party jar containing TokenStreams, and also own implementations. Are there use cases where it would be necessary for us to provide a switch to run in only-old-API mode?
There is still the problem with a TokenStream overriding a deprecated method of a core filter that will be never be called anymore (see LUCENE-1678 which faces the same problem). I will try to fix this here using the same mechanism.
I tested with mixing contrib tokenfilters and core filters. I have seen no problems.
There is still the problem with a TokenStream overriding a deprecated method of a core filter that will be never be called anymore (see
LUCENE-1678which faces the same problem). I will try to fix this here using the same mechanism.I tested with mixing contrib tokenfilters and core filters. I have seen no problems.
Yeah that is a good fix for overriding the non-final methods of the core filters.
I guess what I meant here is that my invalid use case could happen in the field: Let's say something like tee/sink lived in the third-party jar and the user upgrades to Lucene 2.9 and also upgrades the own streams/filters, but a version of the third-party jar that has the new implementations is not available yet. The user couldn't simply implement both the new and old API and use such a filter then with the not-updated third party jar, unless there was a only-old-api switch.
But I'm not sure how realistic this scenario is. I guess we'll find it out sooner or later
Have you seen my backwards compatibility test, too?
Oh cool... I guess I should have checked that before...
I think if I remove the invalid tests it looks pretty similar to yours, so let's keep the one you have in the patch.
I'm going to bed now... will add the new tee/sink stuff tomorrow.
But I'm not sure how realistic this scenario is. I guess we'll find it out sooner or later
I think in 2.9 we have a lot of BW breaks (see the long list in CHANGES.txt at the begin). This not so realistic case is just one more There should be a BW note in CHANGES.txt about the new TokenStream API and possible traps (something like: "we did our best to make it bw-compatible, but there may be the following problems: <list>. You Should upgrade all your TokenStreams as soon as possible, especially if a strange behaviour occurs.")
Do you have a example of TeeSink and CachingAttributesFilter?
Yeah that is a good fix for overriding the non-final methods of the core filters
I would try to do this now.
After the whole day thinking about a solution for overriding deprecated methods, I came to one conclusion/solution, that I would create a "visible" backwards break (to be noted in CHANGES.txt).
Mike's idea from LUCENE-1678 is good, but very complicated for this issue and may lead to unpredicted behavior. And what makes me think, that this will not be a problem for developers, is the fact that there is no JIRA issue about a similar break in the past: When Lucene switched from next() to next(reusableToken), we also had a compatibility method in TokenStream that delegates to next(new Token()). Core streams did not implement the old method and the indexer code only called next(Token). If somebody would have overridden only the old next() method of a core tokenstream, this method would have been never called -> bumm we have a break, but nobody realized it. With the new patch, we have the same in 2.9 for incrementToken vs. next(Token) and also next(). In principle the same issue like in LUCENE-1678.
The good thing is, that most TokenStreams in core are final, except the following ones:
- ISOLatin1Filter
- KeywordTokenizer
- StandardTokenizer
...and last but not least the whole structure of subclasses of CharTokenizer. The good thing is (and thanks to the developer!), they are correctly implemented, making their methods incrementToken, next(Token) final. Haha, nobody could override them, so the class is not final, but the affected methods. So all subclasses of CharTokenizer are also not affected.
My latest patch also includes this final modifier for the abstract CharTokenizer:
public final Token next(final Token reusableToken) throws IOException { // Overriding this method to make it final as before has no effect for the reflection-wrapper in TokenStream. // TokenStream.hasReusableNext is true because of this, but it is never used, as incrementToken() has preference. return super.next(reusableToken); }
So it is not overrideable and is still compatible (code calling next(Token) will be delegated to incrementToken() by the superclass). For complete correctness also next() should be similar overridden. In both cases the super's method always delegates preferably to incrementToken() so iven that a subclass of TokenStream overrides this method and so hasNext == true and hasReusableNext == true, incrementToken() is still preferred, so everything works.
This prevents users from overriding next() or next(Token) of core or contrib tokenstreams (which in my opinion nobody has ever done, because if yes, we would have a bug report regarding the last transition).
For those people, that really have done it (they used one of the tree classes above as super for their own class), the error would not be to detectable. Their TokenStream would simply not work, as next()/next(Token) is never called. To produce a compile error for them (or a runtime error, when they instantiate such a class), I suggest to include this backwards-break (which is better than failing silently). All non-final TokenStreams/Tokenizers/TokenFilters should simply include the code snipplet above to redeclare next() and next(Token) as final (only delegating to super) in the first subclass that implements incrementToken(). Instead of failing silently, users will get runtime linker errors (when they replace the lucene jar) or compile errors. We have done a similar change in TokenFilter, because we made the delegate stream final to prevent disturbing the attributes (Mike have done this in LUCENE-1636).
CHANGES.txt would contain this as BW-break together with the other breaks.
Any comments? Michael, what do you think?
I like the cleanup you did! Good that initialize() is gone now. The only small performance improvement we should probably make is to avoid checking which method in TokenStream is overridden when onlyUseNewAPI==true.
To produce a compile error for them (or a runtime error, when they instantiate such a class), I suggest to include this backwards-break (which is better than failing silently). All non-final TokenStreams/Tokenizers/TokenFilters should simply include the code snipplet above to redeclare next() and next(Token) as final (only delegating to super) in the first subclass that implements incrementToken().
+1. I think this backwards-compatibility break is acceptable and makes sense. Most likely the final was just forgotten in these three classes in the first place - all the other core classes declare these methods correctly as final. So we can kind of consider this as a bug fix.
And I like that they will get a compile or link error, instead of seeing undefined behavior.
This is basically your last patch with these changes:
- I removed AttributeSource.setAttributeFactory(factory). Since we have the constructor now that takes the factory as an arg, there should be no need to ever change the factory after a TokenStream was created. It would also lead to problems regarding e.g. Tee/Sink: a user could add attributes to the Tee, then change the factory, then create the sink. How could we then create the same attribute impls for the sink? So I think the right thing to do is to not allow changing the factory after the stream is instantiated.
- I added the initial (untested) version of TeeSinkTokenFilter to demonstrate how I think it should work now. I'll finish tomorrow or Friday (add more javadocs and unit test). I'll also add the CachingAttributeTokenFilter, which is essentially almost the same as the new inner class of TeeSinkTokenFilter. When I have CATF the inner class can probably just extend it.
Ok looks good. I think you will go to bed now, so the work would not collide. If you start to program again, ask me, that I will post a patch (which makes merging simplier). TortoiseSVN has a problem with merging added files, so when applying your patch I have to remove them first
Some comments:
- TeeSinkTokenFilter looks good, I think we should also add a test for it (in principle the version of TestTeeTokenFilter from current trunk, not the one reverted to old API from the current patch)
- I do not understand completely why this WeakReference is needed between Tee and Sink? If it is needed, the code may fail with NPE, when Reference.get() returns null. The idea is, that one can create a Sink for the Tee and throw the Sink away. Tee would then simply not pass the attributes anymore to the sink? If this is the case, the check for Reference.get()==null is really missing.
- Should I implement CachingAttributesFilter as replacement for CachingTokenFilter, or will you do it together with TeeSink?
I will now start to add all the finals to the missing core analyzers.
The only small performance improvement we should probably make is to avoid checking which method in TokenStream is overridden when onlyUseNewAPI==true
I could disable this for next() and next(Token). In the case of incrementToken, it should really check, that it is enabled, because not doing so would fail hard create endless loops. So the check should be there in all cases. But if onlyUseNewAPI is enabled, I could simply define hasNext and hasReusableNext=false. I will do this.
Favor to ask, when this is ready to commit, can you give a few days notice so that the rest of us can look at it before committing? I've been keeping up with the comments, but not the patches.
New patch with some more work. First the phantastic news:
As CachingTokenFilter has no API to access the cached attributes/tokens directly, it does not need to be deprecated, it just switched the internal and hidden impl to incrementToken() and attributes. I also added an additional test in the BW-Testcase, that checks if the caching also works for your strange POSTokens. And it works! You can even mix the consumers, e.g. first use new API to cache tokens and then replay using the old API. really cool. The problem, why the POSToken was not preserved in the past was an error in TokenWrapper.copyTo(). This method created a new Token and copied the contents into it using reinit(). Now it simply creates a clone and let delegate point to it (this is how the caching worked before).
In principle Tee/SinkTokenizer could also work like this, the only problem with this class is the fact, that it has a public API that exposes the Token instances to the outside. Because of that, there is no way around deprecating.
Your new TeeSinkTokenFilter looks good, it only had one problem:
It used addAttributeImpl to add the attribute of the Tee to the new created Sink. Because of this, the sink got the same instance as the parent added. With useOnlyNewAPI, this does not have an effect for the standard attributes, as the ctor already created a Token instance as implementation and added it to the stream, so addAttributeImpl had no effect.
I changed this to use the getAttributeClassesIterator and added a new attribute instance for each attribute using addAttribute to the sink. As the factory is the same, the attributes are generated in the same way. TeeSinkTokenizer would only not work correctly if somebody addes an custom instance using addAttributeImpl in one ctor of another filter in the chain. In this case, the factory would create another impl and restoreState throws IAE. In backwards compatibility mode (default) the new created sink and also the tee have always the default TokenWrapper implementation, so state restoring also works. You only have a problem if you change useOnlyNewAPIU inbetween (which would always create corrupt chains).
Another idea would be to clone all attribute impls and then add them to the sink - the factory would then not be used?
I started to create a test for the new TeeSinkTokenFilter, but there is one thing missing: The original test created a subclass of SinkTokenizer, overriding add() to filter the tokens added to the sink. This functionality is missing with the new API. The correct workaround would be to plug a filter around the sink and filter the tokens there? The problem is then, that the cache always contains also non-needed tokens (the old impl would not store them in the sink).
Maybe we add the filter to the TeeSinkTokenFilter (getting a State, which would not work, as contents of state pkg-private?). Somehow else? Or leave it as it is and let the user plug the filter on top of the sink (I prefer this)?
I forgot: I also implemented the final next() methods in all non-final classes.
Favor to ask, when this is ready to commit, can you give a few days notice so that the rest of us can look at it before committing? I've been keeping up with the comments, but not the patches.
No problem. I want to finish this until the weekend and then you have time to review it. My holidays start next week on monday, so I have only limited time after that.
As CachingTokenFilter has no API to access the cached attributes/tokens directly,
Oh true well, scratch it from the TODO list...
We knew it'd work conceptually the same for AttributeSource.State; unlike Tee/Sink, which wouldn't even be save to use with the new API if it hadn't the getTokens() method for the reasons I explained above.
Another idea would be to clone all attribute impls and then add them to the sink - the factory would then not be used?
Yes, I thought about this for a while. It would be nice to have this (i. e. cloning an AttributeSource) in general: you could reduce the costs for initializing the TokenStreams with onlyUseNewAPI=false. We just need to keep a static AttributeSource around, that contains the wrapper and the mappings from the 6 default interfaces. Then instead of constructing it every time we just clone that AttributeSource for new TokenStreams.
The query parser could do the same to keep initialization costs of the TokenStreams minimal, because it always needs the same attributes.
I think it should be easy? We just need to implement clone() for AttributeSource.
I made these changes:
- added clone() to AttributeSource and changed TeeSinkTokenFilter to use it.
- added a SinkFilter as inner interface of TeeSinkTokenFilter that adds the missing functionality you mentioned, Uwe.
Patch looks good.
Only one thing: If you clone a TokenStream you will not get a TokenStream, only an AttributeSource instance (if TokenStream does not override). For our use case it is ok, because we only want to have the attributes and impls cloned, but it is strange.
A real clone() method should call super.clone() and then create new maps and copy the old maps into them. Not sure. Or we do not call the method clone() and call it cloneAttributes not returning Object but AttributeSource. E.g.
public AttributeSource cloneAttributes()
I will now rewrite the TeeSink-Test to use the new interface and the test should then pass as before with Tee/Sink separate (but I let both tests available, one that tests the old ones and one that tests the new class). I also add a test for the cloning to TestAttributeSource.
The cloning also speeds up the case with useOnlyNewAPI=true, because the addAttribute-call also uses reflection to find out what interfaces are implemented. In my opinion this cost (the while loop with getSuperclass() and so on) is much more costy than the simple check of the declaring class for a method.
The patch is still missing some javadocs. If we finished this, are there any other things to do? The optimizations in QueryParser to clone and so on are not really part of this issue, so could be done separately.
Or we do not call the method clone() and call it cloneAttributes not returning Object but AttributeSource.
+1. Let's do that.
If we finished this, are there any other things to do? The optimizations in QueryParser to clone and so on are not really part of this issue, so could be done separately.
I agree. I don't think these optimizations are critical at this point. I think updating the javadocs should be the only remaining thing here. (given that everyone else is ok with this patch)
The other related issues I think will be straightforward... except LUCENE-1448, I have the feeling this will cause some headaches too.... not sure if you read the discussions in 1448 yet, Uwe?
+1. Let's do that.
OK, I change this locally. And I remove the Cloneable interface again. In principle, this method cloneAttributes() should only be used, to create a new TokenStream that should use the same attributes, but needs different instances. TeeSink is currently the only example for this, but more may follow.
About the TokenStream clones in QueryParser: I think, this will not work, as the TokenStream then needs to be really cloned with also setting a new Reader for the input. In my opinion, the reusableTokenStream method of Analyzer should handle this and not QueryParser.
The other related issues I think will be straightforward... except
LUCENE-1448, I have the feeling this will cause some headaches too.... not sure if you read the discussions in 1448 yet, Uwe?
Wrrrrr, this one is hard. This final offset is not really fitting very good into the current attributes API, it could be an new extra attribute that is only updated at the end of the stream (but the problem is, that it needs to be done when incrementToken returns false.
...and Mike said all others are trivial
I will now update as noted before
...and Mike said all others are trivial
He also said hes willing to skip that one for 2.9 though.
I'd rather not if we can help it though - I started looking at it last night, but I got side tracked before I got very far thinking about it.
Here my latest patch before I go to bed. I had not much time today, but I implemented parts of the TeeSinkTokenFilter test. The first test and also the performance test are implemented. The performance test is almost as fast as the old Tee/Sink combi (good news). I found a small bug in the new Sink (it did not lazyly created the iterator), but it is fixed and it works as exspected (without calling reset() on the sink first).
The second test is not implementable with TeeSinkTokenizer and this is a limitation: You are not able to combine different sources into one sink (and this is what the second test does). I am not sure, how you could implement this at all with the new API. It would only work if both tee streams have exactly same attributes, so they could feed their attributes into the same sink.
Michael, any idea? The functionality to feed tokens from two streams into one sink is nice, but how to do it? Or is it just a useless theoretical demonstration in the test?
Good night, Uwe
Michael, any idea? The functionality to feed tokens from two streams into one sink is nice, but how to do it? Or is it just a useless theoretical demonstration in the test?
I have personally never used the Tee/Sink stuff, so I'm not sure in what kind of ways people use it.
Also Tee/Sink does not use any internal-only APIs, so everyone could implement it outside of Lucene themselves. Of course this kind of stuff is a bit more tricky with the new API.
If we want to support multiple tees feeding one sink it think it needs to be based on the Attribute interfaces themselves, not on the impls. Then it doesn't really matter, what kind of impls the sink has. The tee would just try to copy the Attribute's value over. But in this case we'd need to implement the copyTo() method a bit differently, so that you can copy values per Attribute interface. This would probably be a bit slower compared to the current TeeSinkTokenFilter approach.
I have personally never used the Tee/Sink stuff, so I'm not sure in what kind of ways people use it.
Does anyone use it? Given the difficulty of using it, esp since Lucene has been sorting fields before analysis (hence you have to name the fields properly to get one to be indexed before the other), maybe no one is using it.
In my opinion, this test that I was not able to reimplement using the new TeeSink is very theoretical. In my opinion someone who consumes two streams could always implement this very simple using the two aproaches:
- consume the first stream completely, the the other
- consume one token from the first, then one token from the next and so on
Who invented the whole Tee/Sink originally for what? Maybe we should simply ask on java-user who have ever used it. I have never heard of Tee/Sink before Michael confronted me with his strange POSToken tests
So the simple case of an AppendingTokenStream could easily implemented in the same way (just a filter with not only one but an iterator of delegates). Just have an empty constructor and then add TokenStreams to it (that adds all attributes of each added tokenstream to itsself). Then each call to incrementToken() consumes each added stream until its exhausted. But I do not know if we really need to add this just for 1% of users that could also do it themselves.
In my opinion the whole Tee/Sink/Appending is very theoretical (look at this test, it is very strange). Whoever would create such a ModuloTokenStream?
Who invented the whole Tee/Sink originally for what?
Yonik and I did.
I have used it and would like to use it more. For the case where the number of tokens captured is roughly less than 50% of the original number of tokens, it is faster than reanalyzing.
It is not theoretical at all, even if the test is. Imagine a stream that identifies phrases, captures them and then tees them to a second field. Phrases happen infrequently relative to the number of overall tokens, so it is much faster to just buffer those few tokens that are part of the phrase and tee them to your "phrases" field.
The fact that it may not be used much now is probably due to lack of marketing more so than lack of functionality.
I have never heard of Tee/Sink before Michael confronted me with his strange POSToken tests
Hey, stop insulting my awesome backwards-compatibility tests!!
Hey, stop insulting my awesome backwards-compatibility tests!!
They are phantastic. But I would use a simple FlagsAttribute/Token.flags() to annotate my tokens instead of creating a POSToken subclass...
It is not theoretical at all, even if the test is. Imagine a stream that identifies phrases, captures them and then tees them to a second field. Phrases happen infrequently relative to the number of overall tokens, so it is much faster to just buffer those few tokens that are part of the phrase and tee them to your "phrases" field.
For this the new TeeSinkTokenFilter is working like a before (only better):
TeeSinkTokenFilter tee = new TeeSinkTokenFilter(stream); doc.add(new Field("everything", tee)); doc.add(new Field("parts1", new DoSomethingTokenFilter(tee.newSinkTokenStream())); doc.add(new Field("parts2", new DoSomethingOtherTokenFilter(tee.newSinkTokenStream(optionalFilter)));
Our problem was more to have two Streams fed into the same Sink. This is not possible anymore, because Sinks are generated by a factory now and are always bound to one Tee.
Given the difficulty of using it, esp since Lucene has been sorting fields before analysis (hence you have to name the fields properly to get one to be indexed before the other), maybe no one is using it.
Can't we fix Tee/Sink so that whichever "tee" is pulled from first, does the caching, and then the 2nd one pulls from the cache?
Ie right now when you create them you are forced to commit to which is "primary" and which is "secondary", but if we relax that then it wouldn't be sensitive to the order in which Lucene indexed its fields.
Of course, someday Lucene may index fields concurrently, then Tee/Sink'll get really interesting
In this case we should rename TeeSink to something like SplitTokenStream (which does not extend TokenStream). One could get then any number of "sinks" from it:
SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend TokenStream!!! TokenStream stream1=splitter.newSinkTokenStream(); TokenStream stream2=splitter.newSinkTokenStream(); ...
In this case the caching would be done directly in the splitter and the sinks are only consumers. The first sink that calls to get the attribute states forces the splitter to harvest and cache the input stream (exactly like CachingTokenStream does it). In principle it would be the same like a CachingTokenStream.
But on the other hand: You can always create a CachingTokenFilter and reuse the same instance for different fields. Because the indexer always calls reset() before consuming, you could re-read it easily. Any additional filters could then plugged in front for each field. In this case the order is not important:
TokenStream stream=new CachingTokenFilter(input); doc.add(new Field("xyz", new DoSomethingTokenFilter(stream))); doc.add(new Field("abc", new DoSometingOtherTokenFilter(stream))); ...
This would not work, if the indexer can consume the different fields in parallel. But with the current state it would even not work with Tee/Sink (not multithread compatible).
New and final patch. I will be in holidays from tomorrow and have limited time for Lucene. I will respond to comments and if there are major faults with the new API.
The new patch has some imporvements:
- TeeSinkTokenizer is now able to also feed multiple tees to one sink. You create first a TeeSinkTokenizer, retrieve a TeeTokenStream from it (newSinkTokenStream()). After that you can add this TeeTokenStream to another tee (addSinkTokenStream()). The test (now similar to the old test) and javadocs demonstrates this.
- Reflection performance was greatly improved by using caches. Most time was used in AttributeSource.addAttributeImpl() because it iterates through all interfaces of the supplied instance. It caches the found interfaces using a IdentityHashMap<Class<AttributeImpl>,LinkedList<Class<Attribute>>> keyed by the implementation class. Also the default AttributeFactory uses a cache (IdentityHashMap) for the mapping <Class<Attribute>,Class<AttributeImpl>>. So the number of Class.forName() is drastically reduced.
- Also fixed a bug in addAttributeImpl after refactoring for the cache.
- TokenStream now has a separate AttributeFactory available, that creates a TokenWrapper for the 6 default attributes. This is now a more clear implementation. The extra checks in next() default impls were removed because of this. Filters now also reuse the tokenWrapper instance already resolved by the input stream.
I did some performance tests with the final impl, analyzing the lorem ipsum text 100000 times with new instances for each time, using reused instances, old/new API for the trunk with latest patch, current trunk and lucene-2.4 (old api only):
The results (but these test are not very representative due to a variance of +/- 4 sec per run):
Testing trunk w/ newest API... Time for 100000 runs with new instances (old API): 27.344s Time for 100000 runs with reused stream (old API): 21.828s Time for 100000 runs with new instances (new API only): 27.297s Time for 100000 runs with reused stream (new API only): 24.484s Testing trunk w/o newest API... Time for 100000 runs with new instances (old API): 22.485s Time for 100000 runs with reused stream (old API): 19.047s Time for 100000 runs with new instances (new API only): 26.89s Time for 100000 runs with reused stream (new API only): 23.719s Testing 2.4... Time for 100000 runs with new instances (old API): 18.984s Time for 100000 runs with reused stream (old API): 18.75s
The cost of creating 100000 new instances on my 32 bit Thinkpad T60 is about 5 sec (no difference between new api and old api). The cost is not caused by reflection, it is caused by building the LinkedHashMaps for the attributes on creation. A little bit faster was the current trunk, because it uses only one LinkedHasMap.
One interesting thing: Using only the new api is little slower during tokenization, because it seems faster to use only one instance (Token) instead of 6 instances.
The cost of creating new instances is smallest with Lucene 2.4, because no attributes are used (in 2.9 it always creates the LinkedHashMaps, even if only the old API was used in current trunk).
Michael: can you wait for comments and commit & close then? Or should I do this on Wednesday? After closing this, there are some more issues to be notified/closed, because some of the core tokenstreams were already upgraded to latest API.
Hi Uwe,
I haven't reviewed the latest patch yet, but it sounds like great improvements you made!
I can commit this after waiting a few days, and will also review during the time. And then I'll work on the related issues, just a bit worried still about 1448.
OK!
I forgot to mention: I replaced the lucene JARs in trunk Solr and ran all the tests there, all of them pass! No speed degradion in solr, and all analyzers and query parsers work (and they still use the old API!). Also all Lucene core, Lucene contrib and backwards tests pass.
By the way: the additional speed degradion even when using the old API with the current trunk version (which should have no speed degradion there when reusing stream) is caused by the latest CharFilters added to the Tokenizers. So please do not say, they come from the new API!
Forgot CHANGES.txt entries with the backwards-break note and other changes.
When starting to transform the contrib streams, we should not forget to do the following (maybe add this note to the JIRA issues):
- rewrite and replace next(Token)/next() implementations by new API
- if the class is final, no next(Token)/next() methods needed (must be removed!!!)
- if the class is non-final add the following methods to the class:
/** @deprecated Will be removed in Lucene 3.0. This method is final, as it should * not be overridden. Delegates to the backwards compatibility layer. */ public final Token next(final Token reusableToken) throws java.io.IOException { return super.next(reusableToken); } /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should * not be overridden. Delegates to the backwards compatibility layer. */ public final Token next() throws java.io.IOException { return super.next(); }
Also the incrementToken() method must be final in this case.
I like the cache you added to AttributeSource for speedup!
One thing we could improve is to add this check, which avoids iterating the foundInterfaces list in case the AttributeImpl has been added previously:
/** Adds a custom AttributeImpl instance with one or more Attribute interfaces. */ public void addAttributeImpl(final AttributeImpl att) { final Class clazz = att.getClass(); if (attributeImpls.containsKey(clazz)) return;
Yes, good idea. I added it locally, works good. This is also a speed improvement (the synchronized access to the cache is also removed by this).
I think a new patch is not needed, just add it to your checkout, too!
OK I looked at the latest patch! It looks good – only these questions:
- Are we going to keep Token.java or not? (Current patch still has
it deprecated).
- Typo (occure) in CHANGES.txt
- It looks like we are making all core tokenizers/filters final (to
foreclose future back-compat problems if we need to change the
API)? If so, we should also make CachingTokenFilter final?
One of the things that works really well in Solr is that any time some significant JIRA issue is undertaken, a Wiki page is also generated that effectively documents the ideas in the patch, as well as how to use it and thus results in the final page effectively becoming the documentation. I know Mike and Uwe have done a ton of work on this, but would it be too much trouble to ask for a Wiki page that describes the current state of the patch? It is really hard to follow, in JIRA, all the different threads and which ones are still valid and which are not.
Mark Miller on java-dev:
* Are we going to keep Token.java or not? (Current patch still has it deprecated).
I need to know this as well - I have to make a new Token class for the Highlighter package if this one is deprecated. It would seem a convenience to keep it around.
* Are we going to keep Token.java or not? (Current patch still has it deprecated).
I need to know this as well - I have to make a new Token class for the Highlighter package if this one is deprecated. It would seem a convenience to keep it around.
I would also keep it non-deprecated. It is useful also as a convenience class and can be used as one attribute instance for all 6 base attributes (the backwards compatibility layer of the current patch already does this).
Updated patch that has the following changes:
- Updated QueryParser.jj instead of .java and rebuilt the parser
- added final to incrementToken() in CachingTokenFilter
- added Michael's AttributeSource improvement
Uwe's latest patch with these changes:
- Token is not deprecated anymore.
- Updated the examples in the 'New TokenStream API" section in package.html (analysis package) to reflect the changes this patch introduces (Attribute vs. AttributeImpl.)
- All new files have svn:eol-style=native set now.
test-core, test-contrib and test-tag all pass. I think this is ready to commit now.
We might want to improve some javadocs a bit, especially the "expert" APIs, such as AttributeFactory. Also another section about how to improve caching performance would be good. But I think we should open a separate issue for the doc improvements to get the other related issues going. The main APIs are pretty well documented I think.
I think the points listed in the description of this issue are still mostly valid. Uwe, do you want to add the changes you made in addition?
Forgot the new TeeSinkTokenFilter and reformatted the text to use NLs.
My first post to the list, it appears i should comment here in the JIRA, not reply to email, apologize if i did this wrong:
I've been following this AttributeSource/TokenStream patch thread and reviewing the changes/backwards compatibility issues and the changes.
extremely interesting problem/solution.
while looking at Uwe's PerfTest3 I noticed an unused allocation in the last run for "reused stream new API only"
for (int i = 0; i < c; i++) {
if (i==1000) t = System.currentTimeMillis();
tz.reset(new StringReader(text));
// Token reusableToken=new Token(); <<<< This one
int num=0;
while (tok.incrementToken())
}
just a small cost, but makes the new reusable api slightly faster
With extra alloc:
Time for 100000 runs with new instances (old API): 12.75s
Time for 100000 runs with reused stream (old API): 9.969s
Time for 100000 runs with new instances (new API only): 13.969s
Time for 100000 runs with reused stream (new API only): 11.735s <<<<<<<<<<<<<<<<<<<<<<<<<<
Without extra alloc (changes only the last line's time):
Time for 100000 runs with new instances (old API): 12.593s
Time for 100000 runs with reused stream (old API): 9.578s
Time for 100000 runs with new instances (new API only): 13.75s
Time for 100000 runs with reused stream (new API only): 11.453s <<<<<<<<<<<<<<<<<<<<<<<<<<
dave
Thanks, Dave... I'll remove that unused allocation before committing.
OK, I think we're finally ready to commit here!
I'll wait until Friday - if nobody objects until then, I will commit the latest patch.
Token is no longer deprecated, instead it implements all 6 standard
token interfaces (see above). The wrapper for next() and next(Token)
uses this, to automatically map all attribute interfaces to one
TokenWrapper instance (implementing all 6 interfaces), that contains
a Token instance. next() and next(Token) exchange the inner Token
instance as needed. For the new incrementToken(), only one
TokenWrapper instance is visible, delegating to the currect reusable
Token. This API also preserves custom Token subclasses, that maybe
created by very special token streams (see example in Backwards-Test)
Token is no longer deprecated, but all the methods that return it are, right? I think they are, but it is late, and I may have missed something. So, what happens in 3.0? What good does a Token do at that point?
Also, in looking at incrementToken(), nearly the entire implementation seems to be based on deprecated stuff, doesn't that mean it has to all be re-implemented in 3.0? Something about that doesn't feel right.
More later...
Token is no longer deprecated, but all the methods that return it are, right? I think they are, but it is late, and I may have missed something. So, what happens in 3.0? What good does a Token do at that point?
It's what you often asked for: if you don't want to deal with multiple Attributes you can simply add a Token to the AttributeSource and cache the Token reference locally in your stream/filter, because Token now implements all core token attributes.
Also, in looking at incrementToken(), nearly the entire implementation seems to be based on deprecated stuff, doesn't that mean it has to all be re-implemented in 3.0? Something about that doesn't feel right.
Ideally there wouldn't be an incrementToken() implementation and it would be abstract. However, that's of course not backwards-compatible. Everything is deprecated in the implementation because this code was only written to support backwards-compatibility with the old API.
And in 3.0 all my phantastic backwards-compatibility stuff is completely removed - and incrementToken() is abstract - What a pity The default impl in incrementToken and all other code parts in TokenStream.java will be removed (this is why also new code parts are marked deprectated, it makes it easier to be removed). The code will be about 1/3 of the current size when the backwards layer is removed
The biggest problem of the migration to the new API is the backwards stuff, because we have now 3 different TokenStream APIs (next(), next(Token), incrementToken()). Also TokenWrapper is backwards stuff and will be removed.
if you don't want to deal with multiple Attributes you can simply add a Token to the AttributeSource and cache the Token reference locally in your stream/filter, because Token now implements all core token attributes.
See the test case for AttributeSource, which tests this.
But I think it is not as easy if you have a chain of TokenFilters. Only the first one can add the Token Impl to the AttSource (when the attributes are not yet added). So if one TokenStream adds a TermAttribute and later a Token impl is added, the Token will handle all attributes except the TermAttribute.
To force a whole chain to use Token as AttributeImpl, the first created TokenStream (normally the Tokenizer) should set an AttributeFactory, that creates a Token. All filters will then get it from the parent.
So in general you can add an Token instance to the AttributeSource but should still reference the attributes by the interfaces.
So in general you can add an Token instance to the AttributeSource but should still reference the attributes by the interfaces.
I completely agree. We should discourage users to reference Token and rather use the Attribute interfaces. That's the whole beauty and flexibility about this new API.
However, using Token as the actual implementing instance can be convenient to optimize caching or serialization performance.
Grant, are you still reviewing? I was going to commit this today... shall I wait?
I completely agree. We should discourage users to reference Token and rather use the Attribute interfaces. That's the whole beauty and flexibility about this new API.
Has there been any thought into reconsidering the new API's "experimental" status then?
I don't think the WARNING: encourages users to use these interfaces!
or maybe a compromise: maybe modify the javadocs to be a little less scary: does this text have to be FF0000 (red) ?
Actually, one thing I still don't get:
What happens to the attributes that have traditionally been thrown away during indexing? ie offset, type? How would one add them into the index like other attributes? Or, for that matter, exclude them.
I seem to recall there being a loop over attributes somewhere in the posting process, but I can no longer find that code.
I don't think the WARNING: encourages users to use these interfaces!
Yeah we should improve that. Let me commit the patch as is and open a new issue for improving the warnings.
I don't want to touch this patch anymore, it's so big.
What happens to the attributes that have traditionally been thrown away during indexing? ie offset, type? How would one add them into the index like other attributes? Or, for that matter, exclude them.
The default index format does not make use of some attributes, e.g. type, just as before.
Flexible indexing will allow to customize the format; then you will be able to store whatever attribute you like in the index.
Committed revision 797665.
Thanks, Uwe, for all your hard work!!
And thanks to everyone else who helped reviewing here.
Not sure what issue it stems from, but Token has a bunch of constructors that are deprecated, but that don't point you to something new.
edit
must have come from the setBuffer stuff
Not sure what issue it stems from, but Token has a bunch of constructors that are deprecated, but that don't point you to something new.
This is already the case in Lucene 2.4; unrelated to this issue.
This is a small improvement, related to Grant's comments:
The TokenStream ctor can have a AttributeFactory, so you can create a subclass of TokenStream that uses a specific AttributeFacory (e.g. using Token instances). Filters do not need this (as they use the factory of the input stream).
The factory must therefore be set on the root stream. This is normally a subclass of Tokenizer. The problem: Tokenizer does not have ctors for AttributeFacory, so you are not able to create any Tokenizer using a custom factory, e.g. for using Token as impl.
I will commit this patch shortly.
Patch that includes all mentioned improvements, but needs cleanup, documentation improvements and junits.