|
This patch looks good; thanks DM! I made a few small changes & attached a new rev of the patch:
I've broken up the changes to core and contrib into small patches. My reasoning for this is that it is a lot easier for me to keep up with other people's changes and re-issue a small patch.
Apply This one is for core in o.a.l.analysis but not the files contained in All the code has been migrated to use the reuse interface. I have run all the tests and they pass. (There is a weird dependency. One test depends upon demo. I hacked that to work.) I did not test to see if the code were any slower or faster. I'm not sure how one would do that. I think it should be faster since it doesn't bounce back and forth with termText and termBuffer.
I did not improve the code to use char[] instead of String. The places that can be improved call Token.setTermBuffer(String). I don't think these are necessary to this issue being resolved. A couple of other minor opportunities for char[] manipulation via o.a.l.util.ArrayUtil. Here and there, there is a need to remove leading and trailing whitespace from input before (sub-)tokenizing it. Currently this is done with String.trim(), even when working with char[]. It is sub-optimal as marshalling it into a String involves allocation and copying. Likewise, so does getting it out. It would be better to have int trim(char[]) which shifts leading spaces off the front and returns the length of the string without trailing spaces. There is a "randomize" routine that shuffles an array. While this is only used in one place, it appears to be general purpose array manipulation. This issue supersedes
This patch includes all the previous ones.
Note: It includes the functionality solving I did not do the "reuse" API mentioned in OK since you pulled it all together under this issue, I think we should commit this one instead of
DM, one pattern that makes me nervous is this, from QueryTermVector.java:
for (Token next = stream.next(new Token()); next != null; next = stream.next(next)) { I don't think you should be "recycling" that next and passing it back in the next time you call stream.next, because a TokenStream is not required to use the Token you had passed in and so suddenly you are potentially asking it to re-use a token it had previously returned, which it may not expect. Likely it won't matter but I think this is still safer: final Token result = new Token(); for (Token next = stream.next(result); next != null; next = stream.next(result)) { I'll give my analysis here. Feel free to make the change or kick it back to me to make it, if you think your pattern is best. (If I do it, it will be after this weekend.)
I've tried to be consistent here. The prior pattern was inconsistent and often was: Token token = null; while ((token = input.next()) != null) { There were other variants including "forever loops". As you noticed, I replace 2) Consumer: These "filter" their argument. Their only assumption is that in the call chain that there was a producer that created the token that they need to reuse. In this case, they typically will preserve startOffset and endOffset because those are to represent the position of the token in the input. They may refine type, flags and payload, but otherwise have to preserve them. Most typically, they will set the termBuffer. There are a few types of consumers. Here are some: Callers of Token next(Token) have the responsibility of never calling with a null token. (I think producer tokens probably should check and create a token if it is so. But I don't think that is what they do now.) The upshot of all of this, Producers don't care which token they reuse. If it was from the original loop, or from the result of the last call to token = stream.next(token), both are equally good. The token pre-existed and needs to be fully reset. Consumers presume that the token was produced (or at least appropriately re-initialized and filled in) by a producer. Your form of the loop is very advisable in a few places. Most typically with a loop within a loop, with the inner looping over all the tokens in a stream. In this case, the final Token would be created outside the outer loop. Using your pattern, there would encourage maximal reuse. Using mine, the programmer would have to figure out when it was appropriate to do one or the other. The other value to your pattern is that next(Token) is always called with a non-null Token. I think that calling the token "result" is not the best. It is a bit confusing as it is not the result of calling next(Token). Perhaps, to make reuse acutely obvious: final Token reusableToken = new Token(); for (Token token = stream.next(reusableToken); token != null; token = stream.next(reusableToken)) { This 'final' pattern is indeed more clear about reuse.
But still would like to clarify on what can the TokenStream assume. I think The new patch applies cleanly.
A few tests are failing though - TestChineseTokenizer for one. I'm looking into it, might be that a problem is in the test. Seems start/end offset were lost in ChineseTokenizer.
Adding this in flush(Token) will fix it. token.setStartOffset(start); token.setEndOffset(start+length); Possibly true for other places where new Token() that takes start/end offsets was replaced by clear() and setTermBuffer().
I agree – technically speaking, whenever a Token is returned from a source/filter's next(Token) method, anything is allowed to happen do it (including any & all changes, and subsequent reuse in future calls to next(Token)) and so the current pattern will run correctly if all sources & filters are implemented correctly. This is the contract in the reuse API. It's just that it looks spooky, when you are consuming tokens, not to create & re-use your own reusable token. I think it's also possible (but not sure) that the JRE can compile/run the "single reusable token" pattern more efficienctly, since you are making many method calls with a constant (for the life time of the for loop) single argument, but this is pure speculation on my part... I think from a code-smell standpoint I'd still like to use the "single re-use" pattern when applicable. DM I'll make this change & post a new patch. I'm also seeing the test failures.
OK, I'll fold this into my changes to the patch. Thanks Doron!
Actually I think one should never pass null to next(Token) API – ie a source token stream need not check for null. Thanks Mike!
One more... While looking into the failure of TestSingleTokenTokenFilter I saw that if these lines are executed: Token t1 = new Token(); Token t2 = t1.clone(); boolean isEqual = t1.equals(t2) Surprisingly, isEqual = false Do you see this too?
Hmm, indeed I do see this too – but this is because Token has never overridden "equals" right?
Funny I was about to comment on the same thing... In fact when the reuse API was introduced I believe null was suppose to mean - "nothing to be reused, please just create your own". In case of a producer that cannot reuse - say it creates its own implementation of Token - then there is no point in creating tokens by the consumer that will never be reused. But this also meant that in all the common cases, all tokenizers would need an additional if() to verify that the reusable token is not null. Not so nice. So yes, I agree with you, just need to clarify this in TokenStream.nextToken(Token)'s javadocs.
I agree; I'll do this in next patch iteration...
Yes you're right. I was under the impression for a moment Object's equals() works like clone() and goes in one layer only... that's stupid of course, it just compares the object references (or nulls). I wonder how this test ever passed before... Oh I see it now, - trunk's of SingleTokenTokenStream never called Token.clone() while patched version calls it twice. So definitely, Token should implement equals().
I agree. This is technically a break in backwards compatibility, but I think it's OK?
I think so. I couldn't leave this without seeing that test passing... so here is the code for Token.equals() in case you didn't write that part yet -
@Override public boolean equals(Object obj) { if (this == obj) return true; Token other = (Token) obj; return termLength == other.termLength && startOffset == other.startOffset && endOffset == other.endOffset && flags == other.flags && positionIncrement == other.positionIncrement && subEqual(termBuffer, other.termBuffer) && subEqual(type, other.type) && subEqual(payload, other.payload); } private boolean subEqual(Object o1, Object o2) { if (o1==null) return o2==null; return o1.equals(o2); } I'm unable to get javacc to run cleanly on PrecedenceQueryParser.jj. It produces this:
bash-3.2$ javacc ./src/java/org/apache/lucene/queryParser/precedence/PrecedenceQueryParser.jj Java Compiler Compiler Version 4.0 (Parser Generator) (type "javacc" with no arguments for help) Reading from file ./src/java/org/apache/lucene/queryParser/precedence/PrecedenceQueryParser.jj . . . org.javacc.parser.ParseException: Encountered "<<" at line 658, column 3. Was expecting one of: <STRING_LITERAL> ... "<" ... Detected 1 errors and 0 warnings. Does anyone else hit this?
Thanks Doron; I'll merge into my version ok now I'm curious cause I still can't see that bug...
For PrecedenceQueryParser, Is there a way to use ant to run javacc on contrib/misc? OK new patch attached:
I couldn't find a way... there is a TODO comment in it's build.xml saying something about it though!
OK I'll let you diff
Yes I saw that, but since I don't have a javacc executable.
OK I can see them now... you were right as usual Patch applies cleanly and I'm running the test now, so far all looks good but my PC is not that fast and it will take some more time to complete. I hope to review later tonight or tomorrow morning. Back from a trip. Would have jumped in to help, but it looks like you've found and fixed a couple of my mistakes/oversights. Thanks.
Regarding the equals implementation. Perhaps it would also be good to finish the canonical implementation and provide hashcode? That would allow for storage in sets and maps. (Not terribly sure how that would be useful. But I can imagine sorting on startOffsets to organize the contents.) Regarding the changes to the jj files. I also made the corresponding changes in their generate java files. While it would be good to fix the generation problem, you can compare the jj and java pairs to see that they match. I agree that Token next(Token) should assume a non-null. Perhaps an assert in a producer is a good idea. I find it easier to debug a failed assert than a null pointer exception. All tests pass here.
131 files were modified - I reviewed core and part of contrib, with minor comments only:
I think that
TestSnowball.testFilterTokens() then changes to: public void testFilterTokens() throws Exception { final Token tok = new Token(2, 7, "wrd"); tok.setTermBuffer("accents"); tok.setPositionIncrement(3); Payload tokPayload = new Payload(new byte[]{0,1,2,3}); tok.setPayload(tokPayload); int tokFlags = 77; tok.setFlags(tokFlags); SnowballFilter filter = new SnowballFilter( new TokenStream() { public Token next(Token token) { return tok; } }, "English" ); Token newtok = filter.next(new Token()); assertEquals("accent", newtok.term()); assertEquals(2, newtok.startOffset()); assertEquals(7, newtok.endOffset()); assertEquals("wrd", newtok.type()); assertEquals(3, newtok.getPositionIncrement()); assertEquals(tokFlags, newtok.getFlags()); assertEquals(tokPayload, newtok.getPayload()); } I forgot to say (in last patch): I also added Token.reinit() methods.
OK I'll update this test case.
OK I'll update comment in TokenFilter.
I ended up doing the same.
OK I'll add.
Agreed, I'll add.
Maybe we need a variant of clone that takes a new term buffer & start/end offsets, and creates a new token but with the new term buffer & start/end offsets you've passed in? Analagous to reinit, but first making a cloned token. Attached new patch w/ above changes.
Caching flies in the face of reuse. I think that a comment needs to be some where to that effect.
Putting Tokens into a collection requires that the reusable token be copied. I.e. via new or clone. One cannot directly store the reusable Token, i.e. the argument from Token next(Token), nor the value to be returned from it. If a caching TokenStream is also resettable, then that means that the tokens coming from it need to be protected from being changed. This means that they need to return a copy. (Perhaps comment reset() to that effect?) The only value I see in caching is if the computation of the token stream is so expensive that re-using it has a significant savings. (The current code does not take such care and is a bug. This patch fixes it. It would have become obvious if the cache were used in the context of reuse.) Some TokenStreams cache Tokens internally (as a detail of their implementation) and then return them incrementally. Many of these can be rewritten to compute the next Token when next(Token) is called. This would improve both time and space usage. (I felt that such changes were outside the scope of this patch.) All this leads to my response to the NGram filter. The NGram filters could be improved in this fashion. This would eliminate the clone() problem noted above. But failing that, a variant of clone to solve the intermediate problem would work. So would using new Token(...). The problem with using new Token() is that it requires manual propagation of flags, payloads, offsets and types and is not resilient to future fields.
There's a patch for this in
I figured this out: you have to use javacc 3.2 (not 4.0 or beyond) for contrib/misc, then "ant javacc" runs fine (with the patch from Attached new patch:
I think we are close! I diffed current patch and previous one and all seems correct to me.
One tiny thing in Token - in two locations there's this code: setTermBuffer(newTermBuffer, newTermOffset, newTermLength); setTermLength(newTermLength); Second call is redundant since first call already sets termLength. Woops, you're right – new patch attached.
Still with previous patch - all core tests passed, but then contrib compilation failed.
I tried ant clean contrib-test (to not repeat core test) and still compilation errors: [javac] Compiling 27 source files to build\contrib\analyzers\classes\test
[javac] contrib\analyzers\src\test\org\apache\lucene\analysis\miscellaneous\TestSingleTokenTokenFilter.java:23: cannot find symbol
[javac] symbol : class LuceneTestCase
[javac] location: package org.apache.lucene.util
[javac] import org.apache.lucene.util.LuceneTestCase;
[javac] ^
In yesterday's version all tests passed. Wait...
"ant clean test-comtrib" was a bad idea, because there's a missing dependency of contrib-test in test-compile. So "ant clean compile-test test-contrib" doesn't have the compile errors above. The errors that got me started on this were in contrib/analyzers but with the new patch and after clean, all contrib tests passed! The SinkTokenizer changes don't seem right to me. The token is already cloned in the add() method, no need to clone again in the next() call.
At least, that's my understanding based on trying to piece together the changes being proposed here.
It is because SinkTokenizer implements reset(). SinkTokenizer needs to ensure that the subsequent times it returns the same token that the token is actually the same. A token is not immutable, so SinkTokenizer needs to ensure that the tokens in its list are immutable. The token that is added to the list can be a reusable Token. If it is added without cloning, some other user of the token might change it and thus the list changes. When the token is removed from the list and returned from next, it is presumed to be a reusable token. (This is true whether next() or next(Token) is called.) If it does not return a clone then some other user of the token might change it and thus the internal list changes. This is only a problem when reset is implemented. This is because a single token can be returned more than once from the TokenStream. If SinkTokenizer did not implement reset, then next() would not need to create a clone. My point is it is already cloned when it is added to the list, so now it is being cloned twice, and I think the second clone is extraneous. In other words, it already is returning a clone from next. The only way that Token could be changed is by doing it via the getTokens() method, which would have to be done outside of the Analysis process, which I would argue does not violate the reusability process.
Cloning Tokens is not cheap, as I recall. In fact, I seem to recall testing that it was cheaper to do a new. Now, maybe that is fixed in this issue, but I am not sure. But how do you cope with reset()?
Consider this:
Point is that if step 2 did not return clones, the consumer that invoked step 2 might have overridden the tokens it got from step 2. Now in step 4 sink "thinks" it returns clones of original tokens, unaware that the consumer of step 2 modified them. Sorry, you guys are right. My bad. Does look like we improved some of the cloning costs, though.
I was going on hearsay when I uniformly used clone() rather than new when dealing with creating a deep copy of an existing token. I was under the impression that clone was faster than new to do equivalent work. The test is rather simple and worthy of doing before accepting this issue. I don't think I have time to do it today. The equivalent of clone is (done from memory, so this is close): While this might be faster, there are two flaws with this that clone avoids, clone has direct access to the parts and avoids method calls and also is future proof. If a new field is added to Token, it will automatically be carried forward. There are a couple of places in the code where:
This problem is a bug in the existing code. Today, one can create a chain of TokenFilters, each of which calls input.next() or input.next(token), and any one of which modifies the return value. It does not matter which is invoked. If the token returned is held in a cache then the cache is corrupted. Every cache of Tokens needs to ensure that it's cache is immutable on creation. It also needs to ensure that it is immutable on usage if the tokens can be served more than once. Two personal opinions:
I'm not using sink tokenizer (yet) so not sure if the following worths it, but...
One simple possibility to avoid those extra clones is to add a way to allow notifying the sink that reset() will not be called anymore. I.e. that for each token to be consumed from now on, this is the last time it s consumed. This can be added in the constructor, or a special setter can be added for that, such as: disableRest(). I think the main performance issue with clone in the old Token was that it had to do the initTermBuffer before, even though clone already knows what size the buffer should be, etc. It looks like this is fixed now, so it may be a non-issue.
Regarding the implementation of hashCode:
You are using the following: private static int hashCode(int i) { return new Integer(i).hashCode(); } This is rather expensive. Integer.hashCode() merely returns its value. Constructing a new Integer is unnecessary. While adding Token's integer values in Token's hashCode is perfectly fine, it is not quite optimal. And may cause unnecessary collisions. It might be better to pretend that Token's integer values are also in an array (using the ArrayUtil algorithm, this could be): public int hashCode() { initTermBuffer(); int code = termLength; code = code * 31 + startOffset; code = code * 31 + endOffset; code = code * 31 + flags; code = code * 31 + positionIncrement; code = code * 31 + type.hashCode(); code = (payload == null ? code : code * 31 + payload.hashCode()); code = code * 31 + ArrayUtil.hashCode(termBuffer, 0, termLength); return code; } Also, are the reinit methods used? If not, I'd like to work up a patch that uses them. (And I'll include the above in it.) I'll probably add copyFrom(Token) as a means to initialize one token to have the same content as another. There are a couple of places that this is appropriate.
Duh, I didn't realize that's the hashCode function for Integer. I like you're new hashCode.
I like this, but maybe name it reinit(Token)?
I think we could use it in further places (I didn't search exhaustively). DM if you can pull together a patch w/ these fixes that'd be great!
The typical pattern I used is for TokenFilters: public Token next(final Token reusableToken) { assert reusableToken != null; Token nextToken = input.next(reusableToken); if (nextToken != null) return null; .... Do something with nextToken .... return nextToken; } and for other TokenStreams: public Token next(final Token reusableToken) { assert reusableToken != null; .... Do something with reusableToken .... return reusableToken; } and for looping over a TokenStream: final Token reusableToken = new Token(); for (Token nextToken = stream.next(reusableToken); nextToken != null; nextToken = stream.next(reusableToken)) { .... Do something with nextToken .... }
Patch & changes look good, thanks DM.
I attached new patch with tiny changes:
I plan to commit in a day or two! I just committed this. Thanks DM! This was one humongous patch – almost 10K lines. I sure hope we didn't break anything
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1. JavaDoc is improved (as always, there is still room for improvement. For example, it says the field type is interned, but it is not.)
2. Deprecated the Token constructors taking a String.
3. Changed the allocation policy to be less aggressive.
5. Optimized the growing of the internal termBuffer immediately followed by storing a new term. In doing this added, setTermBuffer(String) and setTermBuffer(String, int, int). Setting from a string is roughly the same cost as setting from a char[].
6. TokenStream has next() has been deprecated. The javadoc has been updated to recommend next(Token) over next().
7. Rather than modifying Term to take a Token, public String term() has been added. With termText() still deprecated, this gives upgraders a clean choice to use term() or termBuffer(), with the knowledge of the performance differences.
I also updated TestToken to test all the changes.
Left to do: (I'd like to get a nod of whether I need to make further changes to the supplied patch before doing #4)
4. Changing of the remainder of core and contrib to remove calls to deprecated Token and TokenStream methods, i.e. to use the reuse API.