Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3.1
    • Fix Version/s: 2.4
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      All

    • Lucene Fields:
      New

      Description

      This was discussed in the thread (not sure which place is best to reference so here are two):
      http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
      or to see it all at once:
      http://www.gossamer-threads.com/lists/lucene/java-dev/62851

      Issues:
      1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
      2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should all be deprecated.
      3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
      4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
      5. Some internal optimizations can be done with regard to char[] allocation.
      6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
      7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

      1. LUCENE-1333-xml-query-parser.patch
        4 kB
        DM Smith
      2. LUCENE-1333-wordnet.patch
        4 kB
        DM Smith
      3. LUCENE-1333-wikipedia.patch
        38 kB
        DM Smith
      4. LUCENE-1333-snowball.patch
        4 kB
        DM Smith
      5. LUCENE-1333-queries.patch
        5 kB
        DM Smith
      6. LUCENE-1333-miscellaneous.patch
        11 kB
        DM Smith
      7. LUCENE-1333-memory.patch
        11 kB
        DM Smith
      8. LUCENE-1333-lucli.patch
        1 kB
        DM Smith
      9. LUCENE-1333-instantiated.patch
        6 kB
        DM Smith
      10. LUCENE-1333-highlighter.patch
        10 kB
        DM Smith
      11. LUCENE-1333-core.patch
        23 kB
        DM Smith
      12. LUCENE-1333-analyzers.patch
        111 kB
        DM Smith
      13. LUCENE-1333-analysis.patch
        32 kB
        DM Smith
      14. LUCENE-1333a.txt
        19 kB
        DM Smith
      15. LUCENE-1333.patch
        19 kB
        Michael McCandless
      16. LUCENE-1333.patch
        25 kB
        DM Smith
      17. LUCENE-1333.patch
        292 kB
        DM Smith
      18. LUCENE-1333.patch
        327 kB
        Michael McCandless
      19. LUCENE-1333.patch
        341 kB
        Michael McCandless
      20. LUCENE-1333.patch
        343 kB
        Michael McCandless
      21. LUCENE-1333.patch
        343 kB
        Michael McCandless
      22. LUCENE-1333.patch
        415 kB
        DM Smith
      23. LUCENE-1333.patch
        415 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          mikemccand Michael McCandless added a comment -

          I just committed this. Thanks DM! This was one humongous patch – almost 10K lines. I sure hope we didn't break anything

          Show
          mikemccand Michael McCandless added a comment - I just committed this. Thanks DM! This was one humongous patch – almost 10K lines. I sure hope we didn't break anything
          Hide
          mikemccand Michael McCandless added a comment -

          Patch & changes look good, thanks DM.

          I attached new patch with tiny changes:

          • Fixed javadoc warnings
          • Removed extra unnecessary comparison in ArrayUtil.getShrinkSize

          I plan to commit in a day or two!

          Show
          mikemccand Michael McCandless added a comment - Patch & changes look good, thanks DM. I attached new patch with tiny changes: Fixed javadoc warnings Removed extra unnecessary comparison in ArrayUtil.getShrinkSize I plan to commit in a day or two!
          Hide
          dmsmith DM Smith added a comment -
          • Added reinit(Token ...) methods to initialize one token from another.
          • Improved hashCode.
          • Made Token next(Token) have a final argument everywhere it is implemented to clarify the "best practice" for reuse and did the same for helper methods in various classes. As part of this, I renamed the token retrieved by next(Token) to be nextToken.

          The typical pattern I used is for TokenFilters:

          public Token next(final Token reusableToken) {
            assert reusableToken != null;
            Token nextToken = input.next(reusableToken);
             if (nextToken != null)
                 return null;
             .... Do something with nextToken ....
             return nextToken;
          }
          

          and for other TokenStreams:

          public Token next(final Token reusableToken) {
            assert reusableToken != null;
             .... Do something with reusableToken ....
             return reusableToken;
          }
          

          and for looping over a TokenStream:

          final Token reusableToken = new Token();
          for (Token nextToken = stream.next(reusableToken); nextToken != null; nextToken = stream.next(reusableToken)) {
              .... Do something with nextToken ....
          }
          
          • Improved Payload.clone() to avoid new if possible.
          • Removed last remaining calls to termText() in a *.jj file.
          Show
          dmsmith DM Smith added a comment - Added reinit(Token ...) methods to initialize one token from another. Improved hashCode. Made Token next(Token) have a final argument everywhere it is implemented to clarify the "best practice" for reuse and did the same for helper methods in various classes. As part of this, I renamed the token retrieved by next(Token) to be nextToken. The typical pattern I used is for TokenFilters: public Token next( final Token reusableToken) { assert reusableToken != null ; Token nextToken = input.next(reusableToken); if (nextToken != null ) return null ; .... Do something with nextToken .... return nextToken; } and for other TokenStreams: public Token next( final Token reusableToken) { assert reusableToken != null ; .... Do something with reusableToken .... return reusableToken; } and for looping over a TokenStream: final Token reusableToken = new Token(); for (Token nextToken = stream.next(reusableToken); nextToken != null ; nextToken = stream.next(reusableToken)) { .... Do something with nextToken .... } Improved Payload.clone() to avoid new if possible. Removed last remaining calls to termText() in a *.jj file.
          Hide
          mikemccand Michael McCandless added a comment -

          This is rather expensive. Integer.hashCode() merely returns its value. Constructing a new Integer is unnecessary.

          Duh, I didn't realize that's the hashCode function for Integer. I like you're new hashCode.

          I'll probably add copyFrom(Token) as a means to initialize one token to have the same content as another. There are a couple of places that this is appropriate.

          I like this, but maybe name it reinit(Token)?

          Also, are the reinit methods used?

          I think we could use it in further places (I didn't search exhaustively).

          DM if you can pull together a patch w/ these fixes that'd be great!

          Show
          mikemccand Michael McCandless added a comment - This is rather expensive. Integer.hashCode() merely returns its value. Constructing a new Integer is unnecessary. Duh, I didn't realize that's the hashCode function for Integer. I like you're new hashCode. I'll probably add copyFrom(Token) as a means to initialize one token to have the same content as another. There are a couple of places that this is appropriate. I like this, but maybe name it reinit(Token)? Also, are the reinit methods used? I think we could use it in further places (I didn't search exhaustively). DM if you can pull together a patch w/ these fixes that'd be great!
          Hide
          dmsmith DM Smith added a comment - - edited

          Regarding the implementation of hashCode:
          You are using the following:

            private static int hashCode(int i) {
              return new Integer(i).hashCode();
            }
          

          This is rather expensive. Integer.hashCode() merely returns its value. Constructing a new Integer is unnecessary.

          While adding Token's integer values in Token's hashCode is perfectly fine, it is not quite optimal. And may cause unnecessary collisions.

          It might be better to pretend that Token's integer values are also in an array (using the ArrayUtil algorithm, this could be):

            public int hashCode() {
              initTermBuffer();
              int code = termLength;
             code = code * 31 + startOffset;
             code = code * 31 + endOffset;
             code = code * 31 + flags;
             code = code * 31 + positionIncrement;
             code = code * 31 + type.hashCode();
             code = (payload == null ? code : code * 31 + payload.hashCode());
             code = code * 31 + ArrayUtil.hashCode(termBuffer, 0, termLength);
             return code;
            }
          

          Also, are the reinit methods used? If not, I'd like to work up a patch that uses them. (And I'll include the above in it.)
          (never mind. I see that they are! super! But I'm working up a patch for this and a couple of minor optimizations that affect Token)

          I'll probably add copyFrom(Token) as a means to initialize one token to have the same content as another. There are a couple of places that this is appropriate.

          Show
          dmsmith DM Smith added a comment - - edited Regarding the implementation of hashCode: You are using the following: private static int hashCode( int i) { return new Integer (i).hashCode(); } This is rather expensive. Integer.hashCode() merely returns its value. Constructing a new Integer is unnecessary. While adding Token's integer values in Token's hashCode is perfectly fine, it is not quite optimal. And may cause unnecessary collisions. It might be better to pretend that Token's integer values are also in an array (using the ArrayUtil algorithm, this could be): public int hashCode() { initTermBuffer(); int code = termLength; code = code * 31 + startOffset; code = code * 31 + endOffset; code = code * 31 + flags; code = code * 31 + positionIncrement; code = code * 31 + type.hashCode(); code = (payload == null ? code : code * 31 + payload.hashCode()); code = code * 31 + ArrayUtil.hashCode(termBuffer, 0, termLength); return code; } Also, are the reinit methods used? If not, I'd like to work up a patch that uses them. (And I'll include the above in it.) (never mind. I see that they are! super! But I'm working up a patch for this and a couple of minor optimizations that affect Token) I'll probably add copyFrom(Token) as a means to initialize one token to have the same content as another. There are a couple of places that this is appropriate.
          Hide
          gsingers Grant Ingersoll added a comment -

          I think the main performance issue with clone in the old Token was that it had to do the initTermBuffer before, even though clone already knows what size the buffer should be, etc. It looks like this is fixed now, so it may be a non-issue.

          Show
          gsingers Grant Ingersoll added a comment - I think the main performance issue with clone in the old Token was that it had to do the initTermBuffer before, even though clone already knows what size the buffer should be, etc. It looks like this is fixed now, so it may be a non-issue.
          Hide
          doronc Doron Cohen added a comment -

          I'm not using sink tokenizer (yet) so not sure if the following worths it, but...

          One simple possibility to avoid those extra clones is to add a way to allow notifying the sink that reset() will not be called anymore.

          I.e. that for each token to be consumed from now on, this is the last time it s consumed.

          This can be added in the constructor, or a special setter can be added for that, such as: disableRest().
          There would not be an enableReset().
          When disabled, next() would not clone, and reset() will throw an exception.

          Show
          doronc Doron Cohen added a comment - I'm not using sink tokenizer (yet) so not sure if the following worths it, but... One simple possibility to avoid those extra clones is to add a way to allow notifying the sink that reset() will not be called anymore. I.e. that for each token to be consumed from now on, this is the last time it s consumed. This can be added in the constructor, or a special setter can be added for that, such as: disableRest(). There would not be an enableReset(). When disabled, next() would not clone, and reset() will throw an exception.
          Hide
          dmsmith DM Smith added a comment -

          Cloning Tokens is not cheap, as I recall. In fact, I seem to recall testing that it was cheaper to do a new. Now, maybe that is fixed in this issue, but I am not sure.

          I was going on hearsay when I uniformly used clone() rather than new when dealing with creating a deep copy of an existing token. I was under the impression that clone was faster than new to do equivalent work.

          The test is rather simple and worthy of doing before accepting this issue. I don't think I have time to do it today.

          The equivalent of clone is (done from memory, so this is close):
          Token token = new Token(oldToken.startOffset(), oldToken.endOffset(), oldToken.getFlags(), oldToken.type());
          token.setPositionIncrement(oldToken.positionIncrement());
          if (oldToken.getPayload() != null) {
          Payload p = new Payload(....); // Create a new Payload with a deep copy of the payload
          }

          While this might be faster, there are two flaws with this that clone avoids, clone has direct access to the parts and avoids method calls and also is future proof. If a new field is added to Token, it will automatically be carried forward.

          There are a couple of places in the code where:
          public Token(Token prototype) // only if new is faster
          and
          public void copyFrom(Token prototype)
          would beneficially solve these maintenance issues.

          But how do you cope with reset()?

          This problem is a bug in the existing code. Today, one can create a chain of TokenFilters, each of which calls input.next() or input.next(token), and any one of which modifies the return value. It does not matter which is invoked. If the token returned is held in a cache then the cache is corrupted. Every cache of Tokens needs to ensure that it's cache is immutable on creation. It also needs to ensure that it is immutable on usage if the tokens can be served more than once.

          Two personal opinions:

          • Caches that don't implement reset should return cache.remove(0) [or equivalent] so it is clear that the cache can only be used once.
          • Caches should not be used except when it gives a clear performance advantage.
          Show
          dmsmith DM Smith added a comment - Cloning Tokens is not cheap, as I recall. In fact, I seem to recall testing that it was cheaper to do a new. Now, maybe that is fixed in this issue, but I am not sure. I was going on hearsay when I uniformly used clone() rather than new when dealing with creating a deep copy of an existing token. I was under the impression that clone was faster than new to do equivalent work. The test is rather simple and worthy of doing before accepting this issue. I don't think I have time to do it today. The equivalent of clone is (done from memory, so this is close): Token token = new Token(oldToken.startOffset(), oldToken.endOffset(), oldToken.getFlags(), oldToken.type()); token.setPositionIncrement(oldToken.positionIncrement()); if (oldToken.getPayload() != null) { Payload p = new Payload(....); // Create a new Payload with a deep copy of the payload } While this might be faster, there are two flaws with this that clone avoids, clone has direct access to the parts and avoids method calls and also is future proof. If a new field is added to Token, it will automatically be carried forward. There are a couple of places in the code where: public Token(Token prototype) // only if new is faster and public void copyFrom(Token prototype) would beneficially solve these maintenance issues. But how do you cope with reset()? This problem is a bug in the existing code. Today, one can create a chain of TokenFilters, each of which calls input.next() or input.next(token), and any one of which modifies the return value. It does not matter which is invoked. If the token returned is held in a cache then the cache is corrupted. Every cache of Tokens needs to ensure that it's cache is immutable on creation. It also needs to ensure that it is immutable on usage if the tokens can be served more than once. Two personal opinions: Caches that don't implement reset should return cache.remove(0) [or equivalent] so it is clear that the cache can only be used once. Caches should not be used except when it gives a clear performance advantage.
          Hide
          gsingers Grant Ingersoll added a comment -

          Sorry, you guys are right. My bad. Does look like we improved some of the cloning costs, though.

          Show
          gsingers Grant Ingersoll added a comment - Sorry, you guys are right. My bad. Does look like we improved some of the cloning costs, though.
          Hide
          doronc Doron Cohen added a comment -

          But how do you cope with reset()?

          Consider this:

          • step 1: tokens added to the sink. They are cloned, so sink "owns" them.
          • step 2: sink is used, until exhausted (until its next(Token) returns null).
          • step 3: sink.reset() is invoked.
          • step 4: sink is used again, until exhaustion.

          Point is that if step 2 did not return clones, the consumer that invoked step 2 might have overridden the tokens it got from step 2. Now in step 4 sink "thinks" it returns clones of original tokens, unaware that the consumer of step 2 modified them.

          Show
          doronc Doron Cohen added a comment - But how do you cope with reset()? Consider this: step 1: tokens added to the sink. They are cloned, so sink "owns" them. step 2: sink is used, until exhausted (until its next(Token) returns null). step 3: sink.reset() is invoked. step 4: sink is used again, until exhaustion. Point is that if step 2 did not return clones, the consumer that invoked step 2 might have overridden the tokens it got from step 2. Now in step 4 sink "thinks" it returns clones of original tokens, unaware that the consumer of step 2 modified them.
          Hide
          gsingers Grant Ingersoll added a comment -

          My point is it is already cloned when it is added to the list, so now it is being cloned twice, and I think the second clone is extraneous. In other words, it already is returning a clone from next. The only way that Token could be changed is by doing it via the getTokens() method, which would have to be done outside of the Analysis process, which I would argue does not violate the reusability process.

          Cloning Tokens is not cheap, as I recall. In fact, I seem to recall testing that it was cheaper to do a new. Now, maybe that is fixed in this issue, but I am not sure.

          Show
          gsingers Grant Ingersoll added a comment - My point is it is already cloned when it is added to the list, so now it is being cloned twice, and I think the second clone is extraneous. In other words, it already is returning a clone from next. The only way that Token could be changed is by doing it via the getTokens() method, which would have to be done outside of the Analysis process, which I would argue does not violate the reusability process. Cloning Tokens is not cheap, as I recall. In fact, I seem to recall testing that it was cheaper to do a new. Now, maybe that is fixed in this issue, but I am not sure.
          Hide
          dmsmith DM Smith added a comment -

          The SinkTokenizer changes don't seem right to me. The token is already cloned in the add() method, no need to clone again in the next() call.

          At least, that's my understanding based on trying to piece together the changes being proposed here.

          It is because SinkTokenizer implements reset(). SinkTokenizer needs to ensure that the subsequent times it returns the same token that the token is actually the same.

          A token is not immutable, so SinkTokenizer needs to ensure that the tokens in its list are immutable.

          The token that is added to the list can be a reusable Token. If it is added without cloning, some other user of the token might change it and thus the list changes.

          When the token is removed from the list and returned from next, it is presumed to be a reusable token. (This is true whether next() or next(Token) is called.) If it does not return a clone then some other user of the token might change it and thus the internal list changes. This is only a problem when reset is implemented. This is because a single token can be returned more than once from the TokenStream.

          If SinkTokenizer did not implement reset, then next() would not need to create a clone.

          Show
          dmsmith DM Smith added a comment - The SinkTokenizer changes don't seem right to me. The token is already cloned in the add() method, no need to clone again in the next() call. At least, that's my understanding based on trying to piece together the changes being proposed here. It is because SinkTokenizer implements reset(). SinkTokenizer needs to ensure that the subsequent times it returns the same token that the token is actually the same. A token is not immutable, so SinkTokenizer needs to ensure that the tokens in its list are immutable. The token that is added to the list can be a reusable Token. If it is added without cloning, some other user of the token might change it and thus the list changes. When the token is removed from the list and returned from next, it is presumed to be a reusable token. (This is true whether next() or next(Token) is called.) If it does not return a clone then some other user of the token might change it and thus the internal list changes. This is only a problem when reset is implemented. This is because a single token can be returned more than once from the TokenStream. If SinkTokenizer did not implement reset, then next() would not need to create a clone.
          Hide
          gsingers Grant Ingersoll added a comment -

          The SinkTokenizer changes don't seem right to me. The token is already cloned in the add() method, no need to clone again in the next() call.

          At least, that's my understanding based on trying to piece together the changes being proposed here.

          Show
          gsingers Grant Ingersoll added a comment - The SinkTokenizer changes don't seem right to me. The token is already cloned in the add() method, no need to clone again in the next() call. At least, that's my understanding based on trying to piece together the changes being proposed here.
          Hide
          doronc Doron Cohen added a comment -

          Wait...
          "ant clean test-comtrib" was a bad idea, because there's a missing dependency of contrib-test in test-compile.
          So "ant clean compile-test test-contrib" doesn't have the compile errors above.
          The errors that got me started on this were in contrib/analyzers but with the new patch and after clean, all contrib tests passed!

          Show
          doronc Doron Cohen added a comment - Wait... "ant clean test-comtrib" was a bad idea, because there's a missing dependency of contrib-test in test-compile. So "ant clean compile-test test-contrib" doesn't have the compile errors above. The errors that got me started on this were in contrib/analyzers but with the new patch and after clean, all contrib tests passed!
          Hide
          doronc Doron Cohen added a comment -

          Still with previous patch - all core tests passed, but then contrib compilation failed.
          I tried ant clean contrib-test (to not repeat core test) and still compilation errors:

              [javac] Compiling 27 source files to build\contrib\analyzers\classes\test
              [javac] contrib\analyzers\src\test\org\apache\lucene\analysis\miscellaneous\TestSingleTokenTokenFilter.java:23: cannot find symbol
              [javac] symbol  : class LuceneTestCase
              [javac] location: package org.apache.lucene.util
              [javac] import org.apache.lucene.util.LuceneTestCase;
              [javac]                              ^
          

          In yesterday's version all tests passed.
          Let me check recent patch.

          Show
          doronc Doron Cohen added a comment - Still with previous patch - all core tests passed, but then contrib compilation failed. I tried ant clean contrib-test (to not repeat core test) and still compilation errors: [javac] Compiling 27 source files to build\contrib\analyzers\classes\test [javac] contrib\analyzers\src\test\org\apache\lucene\analysis\miscellaneous\TestSingleTokenTokenFilter.java:23: cannot find symbol [javac] symbol : class LuceneTestCase [javac] location: package org.apache.lucene.util [javac] import org.apache.lucene.util.LuceneTestCase; [javac] ^ In yesterday's version all tests passed. Let me check recent patch.
          Hide
          mikemccand Michael McCandless added a comment -

          Woops, you're right – new patch attached.

          Show
          mikemccand Michael McCandless added a comment - Woops, you're right – new patch attached.
          Hide
          doronc Doron Cohen added a comment -

          I diffed current patch and previous one and all seems correct to me.

          One tiny thing in Token - in two locations there's this code:

          setTermBuffer(newTermBuffer, newTermOffset, newTermLength);
          setTermLength(newTermLength);
          

          Second call is redundant since first call already sets termLength.

          Show
          doronc Doron Cohen added a comment - I diffed current patch and previous one and all seems correct to me. One tiny thing in Token - in two locations there's this code: setTermBuffer(newTermBuffer, newTermOffset, newTermLength); setTermLength(newTermLength); Second call is redundant since first call already sets termLength.
          Hide
          mikemccand Michael McCandless added a comment -

          Attached new patch:

          • Fixed comments about caching & cloning
          • Added partial clone method, that clones all except replaces termBuffer & start/end offset; fixed NGramTokenFilter and CompoundWordTokenFilterBase to use this method
          • Fixed bug in PrecedenceQueryParser.jj that became obvious once I ran javacc

          I think we are close!

          Show
          mikemccand Michael McCandless added a comment - Attached new patch: Fixed comments about caching & cloning Added partial clone method, that clones all except replaces termBuffer & start/end offset; fixed NGramTokenFilter and CompoundWordTokenFilterBase to use this method Fixed bug in PrecedenceQueryParser.jj that became obvious once I ran javacc I think we are close!
          Hide
          mikemccand Michael McCandless added a comment -

          I'm unable to get javacc to run cleanly on PrecedenceQueryParser.jj.

          I figured this out: you have to use javacc 3.2 (not 4.0 or beyond) for contrib/misc, then "ant javacc" runs fine (with the patch from LUCENE-1353).

          Show
          mikemccand Michael McCandless added a comment - I'm unable to get javacc to run cleanly on PrecedenceQueryParser.jj. I figured this out: you have to use javacc 3.2 (not 4.0 or beyond) for contrib/misc, then "ant javacc" runs fine (with the patch from LUCENE-1353 ).
          Hide
          doronc Doron Cohen added a comment -

          I also made the corresponding changes in their generate java files. While it would be good to fix the generation problem, you can compare the jj and java pairs to see that they match.

          I ended up doing the same.

          There's a patch for this in LUCENE-1353 ...

          Show
          doronc Doron Cohen added a comment - I also made the corresponding changes in their generate java files. While it would be good to fix the generation problem, you can compare the jj and java pairs to see that they match. I ended up doing the same. There's a patch for this in LUCENE-1353 ...
          Hide
          dmsmith DM Smith added a comment -

          Caching flies in the face of reuse. I think that a comment needs to be some where to that effect.
          Putting Tokens into a collection requires that the reusable token be copied. I.e. via new or clone. One cannot directly store the reusable Token, i.e. the argument from Token next(Token), nor the value to be returned from it.

          If a caching TokenStream is also resettable, then that means that the tokens coming from it need to be protected from being changed. This means that they need to return a copy. (Perhaps comment reset() to that effect?)

          The only value I see in caching is if the computation of the token stream is so expensive that re-using it has a significant savings.

          (The current code does not take such care and is a bug. This patch fixes it. It would have become obvious if the cache were used in the context of reuse.)

          Some TokenStreams cache Tokens internally (as a detail of their implementation) and then return them incrementally. Many of these can be rewritten to compute the next Token when next(Token) is called. This would improve both time and space usage.

          (I felt that such changes were outside the scope of this patch.)

          All this leads to my response to the NGram filter.

          The NGram filters could be improved in this fashion. This would eliminate the clone() problem noted above.

          But failing that, a variant of clone to solve the intermediate problem would work. So would using new Token(...). The problem with using new Token() is that it requires manual propagation of flags, payloads, offsets and types and is not resilient to future fields.

          Show
          dmsmith DM Smith added a comment - Caching flies in the face of reuse. I think that a comment needs to be some where to that effect. Putting Tokens into a collection requires that the reusable token be copied. I.e. via new or clone. One cannot directly store the reusable Token, i.e. the argument from Token next(Token), nor the value to be returned from it. If a caching TokenStream is also resettable, then that means that the tokens coming from it need to be protected from being changed. This means that they need to return a copy. (Perhaps comment reset() to that effect?) The only value I see in caching is if the computation of the token stream is so expensive that re-using it has a significant savings. (The current code does not take such care and is a bug. This patch fixes it. It would have become obvious if the cache were used in the context of reuse.) Some TokenStreams cache Tokens internally (as a detail of their implementation) and then return them incrementally. Many of these can be rewritten to compute the next Token when next(Token) is called. This would improve both time and space usage. (I felt that such changes were outside the scope of this patch.) All this leads to my response to the NGram filter. The NGram filters could be improved in this fashion. This would eliminate the clone() problem noted above. But failing that, a variant of clone to solve the intermediate problem would work. So would using new Token(...). The problem with using new Token() is that it requires manual propagation of flags, payloads, offsets and types and is not resilient to future fields.
          Hide
          mikemccand Michael McCandless added a comment -

          Attached new patch w/ above changes.

          Show
          mikemccand Michael McCandless added a comment - Attached new patch w/ above changes.
          Hide
          mikemccand Michael McCandless added a comment -

          CompoundWordTokenFilterBase - createToken() calls clone() which deep
          copies the char array, and then calls also setTermBuffer() which iterates the chars again.

          Maybe we need a variant of clone that takes a new term buffer & start/end offsets, and creates a new token but with the new term buffer & start/end offsets you've passed in? Analagous to reinit, but first making a cloned token.

          Show
          mikemccand Michael McCandless added a comment - CompoundWordTokenFilterBase - createToken() calls clone() which deep copies the char array, and then calls also setTermBuffer() which iterates the chars again. Maybe we need a variant of clone that takes a new term buffer & start/end offsets, and creates a new token but with the new term buffer & start/end offsets you've passed in? Analagous to reinit, but first making a cloned token.
          Hide
          mikemccand Michael McCandless added a comment -

          I forgot to say (in last patch): I also added Token.reinit() methods.

          I think that LUCENE-1350's failing test can be included here?

          OK I'll update this test case.

          I find the comment in TokenFilter about implementing next() vs. next(Token) confusing.
          Perhaps use here the same comment as in Tokenizer.

          OK I'll update comment in TokenFilter.

          I also made the corresponding changes in their generate java files. While it would be good to fix the generation problem, you can compare the jj and java pairs to see that they match.

          I ended up doing the same.

          Perhaps it would also be good to finish the canonical implementation and provide hashcode?

          OK I'll add.

          Perhaps an assert in a producer is a good idea. I find it easier to debug a failed assert than a null pointer exception.

          Agreed, I'll add.

          Show
          mikemccand Michael McCandless added a comment - I forgot to say (in last patch): I also added Token.reinit() methods. I think that LUCENE-1350 's failing test can be included here? OK I'll update this test case. I find the comment in TokenFilter about implementing next() vs. next(Token) confusing. Perhaps use here the same comment as in Tokenizer. OK I'll update comment in TokenFilter. I also made the corresponding changes in their generate java files. While it would be good to fix the generation problem, you can compare the jj and java pairs to see that they match. I ended up doing the same. Perhaps it would also be good to finish the canonical implementation and provide hashcode? OK I'll add. Perhaps an assert in a producer is a good idea. I find it easier to debug a failed assert than a null pointer exception. Agreed, I'll add.
          Hide
          doronc Doron Cohen added a comment -

          I think that LUCENE-1350's failing test can be included here?

          TestSnowball.testFilterTokens() then changes to:

            public void testFilterTokens() throws Exception {
              final Token tok = new Token(2, 7, "wrd");
              tok.setTermBuffer("accents");
              tok.setPositionIncrement(3);
              Payload tokPayload = new Payload(new byte[]{0,1,2,3});
              tok.setPayload(tokPayload);
              int tokFlags = 77;
              tok.setFlags(tokFlags);
          
              SnowballFilter filter = new SnowballFilter(
                  new TokenStream() {
                    public Token next(Token token) {
                      return tok;
                    }
                  },
                  "English"
              );
          
              Token newtok = filter.next(new Token());
          
              assertEquals("accent", newtok.term());
              assertEquals(2, newtok.startOffset());
              assertEquals(7, newtok.endOffset());
              assertEquals("wrd", newtok.type());
              assertEquals(3, newtok.getPositionIncrement());
              assertEquals(tokFlags, newtok.getFlags());
              assertEquals(tokPayload, newtok.getPayload());
            }
          
          Show
          doronc Doron Cohen added a comment - I think that LUCENE-1350 's failing test can be included here? TestSnowball.testFilterTokens() then changes to: public void testFilterTokens() throws Exception { final Token tok = new Token(2, 7, "wrd" ); tok.setTermBuffer( "accents" ); tok.setPositionIncrement(3); Payload tokPayload = new Payload( new byte []{0,1,2,3}); tok.setPayload(tokPayload); int tokFlags = 77; tok.setFlags(tokFlags); SnowballFilter filter = new SnowballFilter( new TokenStream() { public Token next(Token token) { return tok; } }, "English" ); Token newtok = filter.next( new Token()); assertEquals( "accent" , newtok.term()); assertEquals(2, newtok.startOffset()); assertEquals(7, newtok.endOffset()); assertEquals( "wrd" , newtok.type()); assertEquals(3, newtok.getPositionIncrement()); assertEquals(tokFlags, newtok.getFlags()); assertEquals(tokPayload, newtok.getPayload()); }
          Hide
          doronc Doron Cohen added a comment -

          All tests pass here.
          131 files were modified - I reviewed core and part of contrib, with minor comments only:

          • CompoundWordTokenFilterBase - createToken() calls clone() which deep
            copies the char array, and then calls also setTermBuffer() which iterates the chars again.
            This can come to play in some analyzers. Usually I would ignore things like this, but this
            issue is so much about reusing and avoiding unneeded reallocation that I decided to
            bring this up. Actually there is no reallocation, just re-copying, so maybe it is not too bad?
            • Similarly in NgramTokenFilter - seems termBuffer would be copied twice for each "secondary" token.
          • Cloning: behavior regarding cloning is modified in few places.
            • In SingleTokenTokenStream - adding cloning - I think it is correct.
            • In TokenTypeSinkTokenizer.add(Token) cloning was removed because it is
              taken care of in super.add(). I first thought it a bug but no, patch is correct.
          • I find the comment in TokenFilter about implementing next() vs. next(Token) confusing.
            Perhaps use here the same comment as in Tokenizer.
          Show
          doronc Doron Cohen added a comment - All tests pass here. 131 files were modified - I reviewed core and part of contrib, with minor comments only: CompoundWordTokenFilterBase - createToken() calls clone() which deep copies the char array, and then calls also setTermBuffer() which iterates the chars again. This can come to play in some analyzers. Usually I would ignore things like this, but this issue is so much about reusing and avoiding unneeded reallocation that I decided to bring this up. Actually there is no reallocation, just re-copying, so maybe it is not too bad? Similarly in NgramTokenFilter - seems termBuffer would be copied twice for each "secondary" token. Cloning: behavior regarding cloning is modified in few places. In SingleTokenTokenStream - adding cloning - I think it is correct. In TokenTypeSinkTokenizer.add(Token) cloning was removed because it is taken care of in super.add(). I first thought it a bug but no, patch is correct. I find the comment in TokenFilter about implementing next() vs. next(Token) confusing. Perhaps use here the same comment as in Tokenizer.
          Hide
          dmsmith DM Smith added a comment -

          Back from a trip. Would have jumped in to help, but it looks like you've found and fixed a couple of my mistakes/oversights. Thanks.

          Regarding the equals implementation. Perhaps it would also be good to finish the canonical implementation and provide hashcode? That would allow for storage in sets and maps. (Not terribly sure how that would be useful. But I can imagine sorting on startOffsets to organize the contents.)

          Regarding the changes to the jj files. I also made the corresponding changes in their generate java files. While it would be good to fix the generation problem, you can compare the jj and java pairs to see that they match.

          I agree that Token next(Token) should assume a non-null. Perhaps an assert in a producer is a good idea. I find it easier to debug a failed assert than a null pointer exception.

          Show
          dmsmith DM Smith added a comment - Back from a trip. Would have jumped in to help, but it looks like you've found and fixed a couple of my mistakes/oversights. Thanks. Regarding the equals implementation. Perhaps it would also be good to finish the canonical implementation and provide hashcode? That would allow for storage in sets and maps. (Not terribly sure how that would be useful. But I can imagine sorting on startOffsets to organize the contents.) Regarding the changes to the jj files. I also made the corresponding changes in their generate java files. While it would be good to fix the generation problem, you can compare the jj and java pairs to see that they match. I agree that Token next(Token) should assume a non-null. Perhaps an assert in a producer is a good idea. I find it easier to debug a failed assert than a null pointer exception.
          Hide
          doronc Doron Cohen added a comment -

          For PrecedenceQueryParser... there is a TODO comment in it's build.xml saying something about it though!

          Yes I saw that, but since I don't have a javacc executable.
          Maybe I'll look at that build.xml later.

          OK I'll let you diff

          OK I can see them now... you were right as usual

          Patch applies cleanly and I'm running the test now, so far all looks good but my PC is not that fast and it will take some more time to complete. I hope to review later tonight or tomorrow morning.

          Show
          doronc Doron Cohen added a comment - For PrecedenceQueryParser... there is a TODO comment in it's build.xml saying something about it though! Yes I saw that, but since I don't have a javacc executable. Maybe I'll look at that build.xml later. OK I'll let you diff OK I can see them now... you were right as usual Patch applies cleanly and I'm running the test now, so far all looks good but my PC is not that fast and it will take some more time to complete. I hope to review later tonight or tomorrow morning.
          Hide
          mikemccand Michael McCandless added a comment -

          For PrecedenceQueryParser, Is there a way to use ant to run javacc on contrib/misc?

          I couldn't find a way... there is a TODO comment in it's build.xml saying something about it though!

          ok now I'm curious cause I still can't see that bug...

          OK I'll let you diff And please look closely to see if there are any other bugs! These equals() methods are tricky!

          Show
          mikemccand Michael McCandless added a comment - For PrecedenceQueryParser, Is there a way to use ant to run javacc on contrib/misc? I couldn't find a way... there is a TODO comment in it's build.xml saying something about it though! ok now I'm curious cause I still can't see that bug... OK I'll let you diff And please look closely to see if there are any other bugs! These equals() methods are tricky!
          Hide
          mikemccand Michael McCandless added a comment -

          OK new patch attached:

          • Created Token.equals and Payloads.equals – this fixed
            TestSingleTokenTokenFilter.
          • Switched to the "final reusableToken" pattern.
          • Added to TokenStream.next javadoc stating that parameter should
            never be null
          • Fixed test failures (all tests should pass now)
          Show
          mikemccand Michael McCandless added a comment - OK new patch attached: Created Token.equals and Payloads.equals – this fixed TestSingleTokenTokenFilter. Switched to the "final reusableToken" pattern. Added to TokenStream.next javadoc stating that parameter should never be null Fixed test failures (all tests should pass now)
          Hide
          doronc Doron Cohen added a comment -

          ok now I'm curious cause I still can't see that bug...

          For PrecedenceQueryParser, Is there a way to use ant to run javacc on contrib/misc?

          Show
          doronc Doron Cohen added a comment - ok now I'm curious cause I still can't see that bug... For PrecedenceQueryParser, Is there a way to use ant to run javacc on contrib/misc?
          Hide
          mikemccand Michael McCandless added a comment -

          here is the code for Token.equals() in case you didn't write that part yet

          Thanks Doron; I'll merge into my version Both your version and my version had bugs, so it's great you posted yours! And I hope the merged result has no bugs

          Show
          mikemccand Michael McCandless added a comment - here is the code for Token.equals() in case you didn't write that part yet Thanks Doron; I'll merge into my version Both your version and my version had bugs, so it's great you posted yours! And I hope the merged result has no bugs
          Hide
          mikemccand Michael McCandless added a comment -

          I'm unable to get javacc to run cleanly on PrecedenceQueryParser.jj. It produces this:

          bash-3.2$ javacc ./src/java/org/apache/lucene/queryParser/precedence/PrecedenceQueryParser.jj
          Java Compiler Compiler Version 4.0 (Parser Generator)
          (type "javacc" with no arguments for help)
          Reading from file ./src/java/org/apache/lucene/queryParser/precedence/PrecedenceQueryParser.jj . . .
          org.javacc.parser.ParseException: Encountered "<<" at line 658, column 3.
          Was expecting one of:
              <STRING_LITERAL> ...
              "<" ...
              
          Detected 1 errors and 0 warnings.
          

          Does anyone else hit this?

          Show
          mikemccand Michael McCandless added a comment - I'm unable to get javacc to run cleanly on PrecedenceQueryParser.jj. It produces this: bash-3.2$ javacc ./src/java/org/apache/lucene/queryParser/precedence/PrecedenceQueryParser.jj Java Compiler Compiler Version 4.0 (Parser Generator) (type "javacc" with no arguments for help) Reading from file ./src/java/org/apache/lucene/queryParser/precedence/PrecedenceQueryParser.jj . . . org.javacc.parser.ParseException: Encountered "<<" at line 658, column 3. Was expecting one of: <STRING_LITERAL> ... "<" ... Detected 1 errors and 0 warnings. Does anyone else hit this?
          Hide
          doronc Doron Cohen added a comment -

          I couldn't leave this without seeing that test passing... so here is the code for Token.equals() in case you didn't write that part yet -

            @Override
            public boolean equals(Object obj) {
              if (this == obj)
                return true;
             
              Token other = (Token) obj;
              
              return
                termLength == other.termLength &&
                startOffset == other.startOffset &&  
                endOffset == other.endOffset &&
                flags == other.flags &&
                positionIncrement == other.positionIncrement && 
                subEqual(termBuffer, other.termBuffer) &&
                subEqual(type, other.type) &&
                subEqual(payload, other.payload);
            }
          
            private boolean subEqual(Object o1, Object o2) {
              if (o1==null)
                return o2==null; 
              return o1.equals(o2);
            }
          
          Show
          doronc Doron Cohen added a comment - I couldn't leave this without seeing that test passing... so here is the code for Token.equals() in case you didn't write that part yet - @Override public boolean equals( Object obj) { if ( this == obj) return true ; Token other = (Token) obj; return termLength == other.termLength && startOffset == other.startOffset && endOffset == other.endOffset && flags == other.flags && positionIncrement == other.positionIncrement && subEqual(termBuffer, other.termBuffer) && subEqual(type, other.type) && subEqual(payload, other.payload); } private boolean subEqual( Object o1, Object o2) { if (o1== null ) return o2== null ; return o1.equals(o2); }
          Hide
          doronc Doron Cohen added a comment -

          This is technically a break in backwards compatibility, but I think it's OK?

          I think so.
          I feel good about this whole change, it will make reuse chain more clear, especially once all deprecations can be removed.

          Show
          doronc Doron Cohen added a comment - This is technically a break in backwards compatibility, but I think it's OK? I think so. I feel good about this whole change, it will make reuse chain more clear, especially once all deprecations can be removed.
          Hide
          mikemccand Michael McCandless added a comment -

          So definitely, Token should implement equals().

          I agree. This is technically a break in backwards compatibility, but I think it's OK?

          Show
          mikemccand Michael McCandless added a comment - So definitely, Token should implement equals(). I agree. This is technically a break in backwards compatibility, but I think it's OK?
          Hide
          doronc Doron Cohen added a comment -

          Hmm, indeed I do see this too - but this is because Token has never overridden "equals" right?

          Yes you're right. I was under the impression for a moment Object's equals() works like clone() and goes in one layer only... that's stupid of course, it just compares the object references (or nulls). I wonder how this test ever passed before... Oh I see it now, - trunk's of SingleTokenTokenStream never called Token.clone() while patched version calls it twice. So definitely, Token should implement equals().

          Show
          doronc Doron Cohen added a comment - Hmm, indeed I do see this too - but this is because Token has never overridden "equals" right? Yes you're right. I was under the impression for a moment Object's equals() works like clone() and goes in one layer only... that's stupid of course, it just compares the object references (or nulls). I wonder how this test ever passed before... Oh I see it now, - trunk's of SingleTokenTokenStream never called Token.clone() while patched version calls it twice. So definitely, Token should implement equals().
          Hide
          mikemccand Michael McCandless added a comment -

          just need to clarify this in TokenStream.nextToken(Token)'s javadocs.

          I agree; I'll do this in next patch iteration...

          Show
          mikemccand Michael McCandless added a comment - just need to clarify this in TokenStream.nextToken(Token)'s javadocs. I agree; I'll do this in next patch iteration...
          Hide
          doronc Doron Cohen added a comment -

          Actually I think one should never pass null to next(Token) API - ie a source token stream need not check for null.

          Funny I was about to comment on the same thing...

          In fact when the reuse API was introduced I believe null was suppose to mean - "nothing to be reused, please just create your own".

          In case of a producer that cannot reuse - say it creates its own implementation of Token - then there is no point in creating tokens by the consumer that will never be reused.

          But this also meant that in all the common cases, all tokenizers would need an additional if() to verify that the reusable token is not null. Not so nice.

          So yes, I agree with you, just need to clarify this in TokenStream.nextToken(Token)'s javadocs.

          Show
          doronc Doron Cohen added a comment - Actually I think one should never pass null to next(Token) API - ie a source token stream need not check for null. Funny I was about to comment on the same thing... In fact when the reuse API was introduced I believe null was suppose to mean - "nothing to be reused, please just create your own". In case of a producer that cannot reuse - say it creates its own implementation of Token - then there is no point in creating tokens by the consumer that will never be reused. But this also meant that in all the common cases, all tokenizers would need an additional if() to verify that the reusable token is not null. Not so nice. So yes, I agree with you, just need to clarify this in TokenStream.nextToken(Token)'s javadocs.
          Hide
          mikemccand Michael McCandless added a comment -

          Do you see this too?

          Hmm, indeed I do see this too – but this is because Token has never overridden "equals" right?

          Show
          mikemccand Michael McCandless added a comment - Do you see this too? Hmm, indeed I do see this too – but this is because Token has never overridden "equals" right?
          Hide
          doronc Doron Cohen added a comment -

          Thanks Mike!

          One more...

          While looking into the failure of TestSingleTokenTokenFilter I saw that if these lines are executed:

          Token t1 = new Token();
          Token t2 = t1.clone();
          boolean isEqual = t1.equals(t2)
          

          Surprisingly, isEqual = false

          Do you see this too?

          Show
          doronc Doron Cohen added a comment - Thanks Mike! One more... While looking into the failure of TestSingleTokenTokenFilter I saw that if these lines are executed: Token t1 = new Token(); Token t2 = t1.clone(); boolean isEqual = t1.equals(t2) Surprisingly, isEqual = false Do you see this too?
          Hide
          mikemccand Michael McCandless added a comment -

          (I think producer tokens probably should check and create a token if it is so. But I don't think that is what they do now.)

          Actually I think one should never pass null to next(Token) API – ie a source token stream need not check for null.

          Show
          mikemccand Michael McCandless added a comment - (I think producer tokens probably should check and create a token if it is so. But I don't think that is what they do now.) Actually I think one should never pass null to next(Token) API – ie a source token stream need not check for null.
          Hide
          mikemccand Michael McCandless added a comment -

          Seems start/end offset were lost in ChineseTokenizer.

          OK, I'll fold this into my changes to the patch. Thanks Doron!

          Show
          mikemccand Michael McCandless added a comment - Seems start/end offset were lost in ChineseTokenizer. OK, I'll fold this into my changes to the patch. Thanks Doron!
          Hide
          mikemccand Michael McCandless added a comment -

          I'm also seeing the test failures.

          Show
          mikemccand Michael McCandless added a comment - I'm also seeing the test failures.
          Hide
          mikemccand Michael McCandless added a comment -

          But still would like to clarify on what can the TokenStream assume. I think
          TokenStream cannot assume anything about the token it gets as input, and,
          once it returned a token, it cannot assume anything about how that token
          is used. So why should it not expect being passed the token it just returned?

          The upshot of all of this, Producers don't care which token they reuse.

          I agree – technically speaking, whenever a Token is returned from a source/filter's next(Token) method, anything is allowed to happen do it (including any & all changes, and subsequent reuse in future calls to next(Token)) and so the current pattern will run correctly if all sources & filters are implemented correctly. This is the contract in the reuse API.

          It's just that it looks spooky, when you are consuming tokens, not to create & re-use your own reusable token. I think it's also possible (but not sure) that the JRE can compile/run the "single reusable token" pattern more efficienctly, since you are making many method calls with a constant (for the life time of the for loop) single argument, but this is pure speculation on my part...

          I think from a code-smell standpoint I'd still like to use the "single re-use" pattern when applicable. DM I'll make this change & post a new patch.

          Show
          mikemccand Michael McCandless added a comment - But still would like to clarify on what can the TokenStream assume. I think TokenStream cannot assume anything about the token it gets as input, and, once it returned a token, it cannot assume anything about how that token is used. So why should it not expect being passed the token it just returned? The upshot of all of this, Producers don't care which token they reuse. I agree – technically speaking, whenever a Token is returned from a source/filter's next(Token) method, anything is allowed to happen do it (including any & all changes, and subsequent reuse in future calls to next(Token)) and so the current pattern will run correctly if all sources & filters are implemented correctly. This is the contract in the reuse API. It's just that it looks spooky, when you are consuming tokens, not to create & re-use your own reusable token. I think it's also possible (but not sure) that the JRE can compile/run the "single reusable token" pattern more efficienctly, since you are making many method calls with a constant (for the life time of the for loop) single argument, but this is pure speculation on my part... I think from a code-smell standpoint I'd still like to use the "single re-use" pattern when applicable. DM I'll make this change & post a new patch.
          Hide
          doronc Doron Cohen added a comment -

          Seems start/end offset were lost in ChineseTokenizer.
          Adding this in flush(Token) will fix it.

          token.setStartOffset(start);
          token.setEndOffset(start+length);
          

          Possibly true for other places where new Token() that takes start/end offsets was replaced by clear() and setTermBuffer().

          Show
          doronc Doron Cohen added a comment - Seems start/end offset were lost in ChineseTokenizer. Adding this in flush(Token) will fix it. token.setStartOffset(start); token.setEndOffset(start+length); Possibly true for other places where new Token() that takes start/end offsets was replaced by clear() and setTermBuffer().
          Hide
          doronc Doron Cohen added a comment -

          The new patch applies cleanly.
          A few tests are failing though - TestChineseTokenizer for one.
          I'm looking into it, might be that a problem is in the test.

          Show
          doronc Doron Cohen added a comment - The new patch applies cleanly. A few tests are failing though - TestChineseTokenizer for one. I'm looking into it, might be that a problem is in the test.
          Hide
          doronc Doron Cohen added a comment -

          This 'final' pattern is indeed more clear about reuse.

          But still would like to clarify on what can the TokenStream assume. I think
          TokenStream cannot assume anything about the token it gets as input, and,
          once it returned a token, it cannot assume anything about how that token
          is used. So why should it not expect being passed the token it just returned?

          Show
          doronc Doron Cohen added a comment - This 'final' pattern is indeed more clear about reuse. But still would like to clarify on what can the TokenStream assume. I think TokenStream cannot assume anything about the token it gets as input, and, once it returned a token, it cannot assume anything about how that token is used. So why should it not expect being passed the token it just returned?
          Hide
          dmsmith DM Smith added a comment -

          I'll give my analysis here. Feel free to make the change or kick it back to me to make it, if you think your pattern is best. (If I do it, it will be after this weekend.)

          I've tried to be consistent here. The prior pattern was inconsistent and often was:

           Token token = null;
            while ((token = input.next()) != null) {
          

          There were other variants including "forever loops". As you noticed, I replace
          There are two basic implementations of Token next(Token):
          1) Producer: These create tokens from input. Their pattern is to take their argument and call clear on it and then set startOffset, endOffset and type appropriately. Their assumption is that they have to start with a pristine token and that other than space, there is nothing about the token that is passed in that can be reused.

          2) Consumer: These "filter" their argument. Their only assumption is that in the call chain that there was a producer that created the token that they need to reuse. In this case, they typically will preserve startOffset and endOffset because those are to represent the position of the token in the input. They may refine type, flags and payload, but otherwise have to preserve them. Most typically, they will set the termBuffer. There are a few types of consumers. Here are some:
          a) Transformational Filters: They take their argument and transform it's termBuffer.
          b) Splitting Filters: They take their argument and split the token into several. Sometimes they will return the original; other times just the parts. When creating these tokens, calling clone() on the prototype will preserve flags, payloads, start and end offsets and type. These clones are sometimes stored in a buffer, but sometimes are incrementally computed with each call to next(Token). With the latter, they will typically cache a clone of the passed in token. I think that, when possible, incremental computation is preferable, but at the cost of a less obvious implementation.
          c) Caching Filter: If their buffer is empty, they repeatedly call result = input.next(token), clone and buffer cache their result in some collection. Once full, they will return their buffer's content. If, the caching filter is resettable, they must return clones of their content. Otherwise, down stream consumers may change their arguments, disastrously.

          Callers of Token next(Token) have the responsibility of never calling with a null token. (I think producer tokens probably should check and create a token if it is so. But I don't think that is what they do now.)

          The upshot of all of this, Producers don't care which token they reuse. If it was from the original loop, or from the result of the last call to token = stream.next(token), both are equally good. The token pre-existed and needs to be fully reset. Consumers presume that the token was produced (or at least appropriately re-initialized and filled in) by a producer.

          Your form of the loop is very advisable in a few places. Most typically with a loop within a loop, with the inner looping over all the tokens in a stream. In this case, the final Token would be created outside the outer loop. Using your pattern, there would encourage maximal reuse. Using mine, the programmer would have to figure out when it was appropriate to do one or the other.

          The other value to your pattern is that next(Token) is always called with a non-null Token.

          I think that calling the token "result" is not the best. It is a bit confusing as it is not the result of calling next(Token). Perhaps, to make reuse acutely obvious:

           final Token reusableToken = new Token();
           for (Token token = stream.next(reusableToken); token != null; token = stream.next(reusableToken)) {
          
          Show
          dmsmith DM Smith added a comment - I'll give my analysis here. Feel free to make the change or kick it back to me to make it, if you think your pattern is best. (If I do it, it will be after this weekend.) I've tried to be consistent here. The prior pattern was inconsistent and often was: Token token = null ; while ((token = input.next()) != null ) { There were other variants including "forever loops". As you noticed, I replace There are two basic implementations of Token next(Token): 1) Producer: These create tokens from input. Their pattern is to take their argument and call clear on it and then set startOffset, endOffset and type appropriately. Their assumption is that they have to start with a pristine token and that other than space, there is nothing about the token that is passed in that can be reused. 2) Consumer: These "filter" their argument. Their only assumption is that in the call chain that there was a producer that created the token that they need to reuse. In this case, they typically will preserve startOffset and endOffset because those are to represent the position of the token in the input. They may refine type, flags and payload, but otherwise have to preserve them. Most typically, they will set the termBuffer. There are a few types of consumers. Here are some: a) Transformational Filters: They take their argument and transform it's termBuffer. b) Splitting Filters: They take their argument and split the token into several. Sometimes they will return the original; other times just the parts. When creating these tokens, calling clone() on the prototype will preserve flags, payloads, start and end offsets and type. These clones are sometimes stored in a buffer, but sometimes are incrementally computed with each call to next(Token). With the latter, they will typically cache a clone of the passed in token. I think that, when possible, incremental computation is preferable, but at the cost of a less obvious implementation. c) Caching Filter: If their buffer is empty, they repeatedly call result = input.next(token), clone and buffer cache their result in some collection. Once full, they will return their buffer's content. If, the caching filter is resettable, they must return clones of their content. Otherwise, down stream consumers may change their arguments, disastrously. Callers of Token next(Token) have the responsibility of never calling with a null token. (I think producer tokens probably should check and create a token if it is so. But I don't think that is what they do now.) The upshot of all of this, Producers don't care which token they reuse. If it was from the original loop, or from the result of the last call to token = stream.next(token), both are equally good. The token pre-existed and needs to be fully reset. Consumers presume that the token was produced (or at least appropriately re-initialized and filled in) by a producer. Your form of the loop is very advisable in a few places. Most typically with a loop within a loop, with the inner looping over all the tokens in a stream. In this case, the final Token would be created outside the outer loop. Using your pattern, there would encourage maximal reuse. Using mine, the programmer would have to figure out when it was appropriate to do one or the other. The other value to your pattern is that next(Token) is always called with a non-null Token. I think that calling the token "result" is not the best. It is a bit confusing as it is not the result of calling next(Token). Perhaps, to make reuse acutely obvious: final Token reusableToken = new Token(); for (Token token = stream.next(reusableToken); token != null ; token = stream.next(reusableToken)) {
          Hide
          mikemccand Michael McCandless added a comment -

          DM, one pattern that makes me nervous is this, from QueryTermVector.java:

                    for (Token next = stream.next(new Token()); next != null; next = stream.next(next)) {
          

          I don't think you should be "recycling" that next and passing it back in the next time you call stream.next, because a TokenStream is not required to use the Token you had passed in and so suddenly you are potentially asking it to re-use a token it had previously returned, which it may not expect. Likely it won't matter but I think this is still safer:

                    final Token result = new Token();
                    for (Token next = stream.next(result); next != null; next = stream.next(result)) {
          
          Show
          mikemccand Michael McCandless added a comment - DM, one pattern that makes me nervous is this, from QueryTermVector.java: for (Token next = stream.next( new Token()); next != null ; next = stream.next(next)) { I don't think you should be "recycling" that next and passing it back in the next time you call stream.next, because a TokenStream is not required to use the Token you had passed in and so suddenly you are potentially asking it to re-use a token it had previously returned, which it may not expect. Likely it won't matter but I think this is still safer: final Token result = new Token(); for (Token next = stream.next(result); next != null ; next = stream.next(result)) {
          Hide
          mikemccand Michael McCandless added a comment -

          OK since you pulled it all together under this issue, I think we should commit this one instead of LUCENE-1350. I'll review the [massive] patch – thanks DM!

          Show
          mikemccand Michael McCandless added a comment - OK since you pulled it all together under this issue, I think we should commit this one instead of LUCENE-1350 . I'll review the [massive] patch – thanks DM!
          Hide
          dmsmith DM Smith added a comment -

          This patch includes all the previous ones.

          Note: It includes the functionality solving LUCENE-1350. If this patch is applied before LUCENE-1350, then that issue is resolved. If it is done after then the patch will need to be rebuilt.

          I did not do the "reuse" API mentioned in LUCENE-1350.

          Show
          dmsmith DM Smith added a comment - This patch includes all the previous ones. Note: It includes the functionality solving LUCENE-1350 . If this patch is applied before LUCENE-1350 , then that issue is resolved. If it is done after then the patch will need to be rebuilt. I did not do the "reuse" API mentioned in LUCENE-1350 .
          Hide
          dmsmith DM Smith added a comment -

          This issue supersedes LUCENE-1350, incorporating all the changes in it. This either needs to go in after it or instead of it.

          Show
          dmsmith DM Smith added a comment - This issue supersedes LUCENE-1350 , incorporating all the changes in it. This either needs to go in after it or instead of it.
          Hide
          dmsmith DM Smith added a comment -

          All the code has been migrated to use the reuse interface. I have run all the tests and they pass. (There is a weird dependency. One test depends upon demo. I hacked that to work.) I did not test to see if the code were any slower or faster. I'm not sure how one would do that. I think it should be faster since it doesn't bounce back and forth with termText and termBuffer.

          I did not improve the code to use char[] instead of String. The places that can be improved call Token.setTermBuffer(String). I don't think these are necessary to this issue being resolved.

          A couple of other minor opportunities for char[] manipulation via o.a.l.util.ArrayUtil.

          Here and there, there is a need to remove leading and trailing whitespace from input before (sub-)tokenizing it. Currently this is done with String.trim(), even when working with char[]. It is sub-optimal as marshalling it into a String involves allocation and copying. Likewise, so does getting it out. It would be better to have int trim(char[]) which shifts leading spaces off the front and returns the length of the string without trailing spaces.

          There is a "randomize" routine that shuffles an array. While this is only used in one place, it appears to be general purpose array manipulation.

          Show
          dmsmith DM Smith added a comment - All the code has been migrated to use the reuse interface. I have run all the tests and they pass. (There is a weird dependency. One test depends upon demo. I hacked that to work.) I did not test to see if the code were any slower or faster. I'm not sure how one would do that. I think it should be faster since it doesn't bounce back and forth with termText and termBuffer. I did not improve the code to use char[] instead of String. The places that can be improved call Token.setTermBuffer(String). I don't think these are necessary to this issue being resolved. A couple of other minor opportunities for char[] manipulation via o.a.l.util.ArrayUtil. Here and there, there is a need to remove leading and trailing whitespace from input before (sub-)tokenizing it. Currently this is done with String.trim(), even when working with char[]. It is sub-optimal as marshalling it into a String involves allocation and copying. Likewise, so does getting it out. It would be better to have int trim(char[]) which shifts leading spaces off the front and returns the length of the string without trailing spaces. There is a "randomize" routine that shuffles an array. While this is only used in one place, it appears to be general purpose array manipulation.
          Hide
          dmsmith DM Smith added a comment -

          for contrib xml-query-parser

          Show
          dmsmith DM Smith added a comment - for contrib xml-query-parser
          Hide
          dmsmith DM Smith added a comment -

          for contrib wordnet

          Show
          dmsmith DM Smith added a comment - for contrib wordnet
          Hide
          dmsmith DM Smith added a comment -

          for contrib wikipedia

          Show
          dmsmith DM Smith added a comment - for contrib wikipedia
          Hide
          dmsmith DM Smith added a comment -

          for contrib queries

          Show
          dmsmith DM Smith added a comment - for contrib queries
          Hide
          dmsmith DM Smith added a comment -

          for contrib misc

          Show
          dmsmith DM Smith added a comment - for contrib misc
          Hide
          dmsmith DM Smith added a comment -

          for contrib memory

          Show
          dmsmith DM Smith added a comment - for contrib memory
          Hide
          dmsmith DM Smith added a comment -

          for contrib lucli

          Show
          dmsmith DM Smith added a comment - for contrib lucli
          Hide
          dmsmith DM Smith added a comment -

          For contrib instantiated

          Show
          dmsmith DM Smith added a comment - For contrib instantiated
          Hide
          dmsmith DM Smith added a comment -

          for contrib highlighter

          Show
          dmsmith DM Smith added a comment - for contrib highlighter
          Hide
          dmsmith DM Smith added a comment -

          for contrib snowball

          Show
          dmsmith DM Smith added a comment - for contrib snowball
          Hide
          dmsmith DM Smith added a comment -

          This is for contrib analyzers.

          Show
          dmsmith DM Smith added a comment - This is for contrib analyzers.
          Hide
          dmsmith DM Smith added a comment -

          This patch covers the rest of core Lucene.

          Show
          dmsmith DM Smith added a comment - This patch covers the rest of core Lucene.
          Hide
          dmsmith DM Smith added a comment -

          I've broken up the changes to core and contrib into small patches. My reasoning for this is that it is a lot easier for me to keep up with other people's changes and re-issue a small patch.

          Apply LUCENE-1333.patch before any of the following. I don't know if there is an order dependency for any of the following.

          This one is for core in o.a.l.analysis but not the files contained in LUCENE-1333.patch

          Show
          dmsmith DM Smith added a comment - I've broken up the changes to core and contrib into small patches. My reasoning for this is that it is a lot easier for me to keep up with other people's changes and re-issue a small patch. Apply LUCENE-1333 .patch before any of the following. I don't know if there is an order dependency for any of the following. This one is for core in o.a.l.analysis but not the files contained in LUCENE-1333 .patch
          Hide
          dmsmith DM Smith added a comment - - edited

          Better comments based upon migrating all of core and contrib. This patch replaces the prior two.

          Marked fields as deprecated, with documentation that they should be made private and setters/getters should be used instead.

          Show
          dmsmith DM Smith added a comment - - edited Better comments based upon migrating all of core and contrib. This patch replaces the prior two. Marked fields as deprecated, with documentation that they should be made private and setters/getters should be used instead.
          Hide
          mikemccand Michael McCandless added a comment -

          This patch looks good; thanks DM!

          I made a few small changes & attached a new rev of the patch:

          • Fixed Token.setTermLength to throw IllegalArgumentException if you
            pass in a length > termBuffer.length
          • Changed Token.growTermBuffer to use oal.util.ArrayUtil (it has
            that same growth logic)
          • Changed if statements in Token.growTermBuffer to first handle the
            [I think most frequent] case where termBuffer is already
            allocated.
          • Javadoc/whitespace
          Show
          mikemccand Michael McCandless added a comment - This patch looks good; thanks DM! I made a few small changes & attached a new rev of the patch: Fixed Token.setTermLength to throw IllegalArgumentException if you pass in a length > termBuffer.length Changed Token.growTermBuffer to use oal.util.ArrayUtil (it has that same growth logic) Changed if statements in Token.growTermBuffer to first handle the [I think most frequent] case where termBuffer is already allocated. Javadoc/whitespace
          Hide
          dmsmith DM Smith added a comment -

          The following has been addressed in this patch:
          1. JavaDoc is improved (as always, there is still room for improvement. For example, it says the field type is interned, but it is not.)

          2. Deprecated the Token constructors taking a String.

          3. Changed the allocation policy to be less aggressive.

          5. Optimized the growing of the internal termBuffer immediately followed by storing a new term. In doing this added, setTermBuffer(String) and setTermBuffer(String, int, int). Setting from a string is roughly the same cost as setting from a char[].

          6. TokenStream has next() has been deprecated. The javadoc has been updated to recommend next(Token) over next().

          7. Rather than modifying Term to take a Token, public String term() has been added. With termText() still deprecated, this gives upgraders a clean choice to use term() or termBuffer(), with the knowledge of the performance differences.

          I also updated TestToken to test all the changes.

          Left to do: (I'd like to get a nod of whether I need to make further changes to the supplied patch before doing #4)
          4. Changing of the remainder of core and contrib to remove calls to deprecated Token and TokenStream methods, i.e. to use the reuse API.

          Show
          dmsmith DM Smith added a comment - The following has been addressed in this patch: 1. JavaDoc is improved (as always, there is still room for improvement. For example, it says the field type is interned, but it is not.) 2. Deprecated the Token constructors taking a String. 3. Changed the allocation policy to be less aggressive. 5. Optimized the growing of the internal termBuffer immediately followed by storing a new term. In doing this added, setTermBuffer(String) and setTermBuffer(String, int, int). Setting from a string is roughly the same cost as setting from a char[]. 6. TokenStream has next() has been deprecated. The javadoc has been updated to recommend next(Token) over next(). 7. Rather than modifying Term to take a Token, public String term() has been added. With termText() still deprecated, this gives upgraders a clean choice to use term() or termBuffer(), with the knowledge of the performance differences. I also updated TestToken to test all the changes. Left to do: (I'd like to get a nod of whether I need to make further changes to the supplied patch before doing #4) 4. Changing of the remainder of core and contrib to remove calls to deprecated Token and TokenStream methods, i.e. to use the reuse API.

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              dmsmith DM Smith
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development