Lucene - Core
  1. Lucene - Core
  2. LUCENE-1794

implement reusableTokenStream for all contrib analyzers

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      most contrib analyzers do not have an impl for reusableTokenStream

      regardless of how expensive the back compat reflection is for indexing speed, I think we should do this to mitigate any performance costs. hey, overall it might even be an improvement!

      the back compat code for non-final analyzers is already in place so this is easy money in my opinion.

      1. LUCENE-1794.patch
        56 kB
        Robert Muir
      2. LUCENE-1794.patch
        57 kB
        Robert Muir
      3. LUCENE-1794.patch
        60 kB
        Robert Muir
      4. LUCENE-1794.patch
        74 kB
        Robert Muir
      5. LUCENE-1794.patch
        82 kB
        Robert Muir
      6. LUCENE-1794.patch
        96 kB
        Robert Muir
      7. LUCENE-1794-reusing-analyzer.patch
        6 kB
        Shai Erera
      8. LUCENE-1794_fix.patch
        11 kB
        Robert Muir
      9. LUCENE-1794_fix2.txt
        15 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          reusableTokenStream + tests for the contrib analyzers. only a few are non-final and require the back compat code.

          Show
          Robert Muir added a comment - reusableTokenStream + tests for the contrib analyzers. only a few are non-final and require the back compat code.
          Hide
          Robert Muir added a comment -

          no code changes, but improve the reusableTokenStream tests for cn and smartcn to also test offsets, testing that reset() is working correctly.

          Show
          Robert Muir added a comment - no code changes, but improve the reusableTokenStream tests for cn and smartcn to also test offsets, testing that reset() is working correctly.
          Hide
          Robert Muir added a comment -

          I moved this to 2.9, if there are concerns please feel free to change this, but I think it is ready.

          Show
          Robert Muir added a comment - I moved this to 2.9, if there are concerns please feel free to change this, but I think it is ready.
          Hide
          Shai Erera added a comment -

          Robert - wouldn't it make sense to pull SavedStreams (maybe call it ReusableStreams?) up to Analyzer, and have all the extensions use it? I couldn't help but notice that this code is duplicated in all the Analyzers.

          Also, and I don't know if it's a matter for a different issue - the fact that reusableTokenStream accepts a field name is misleading. On one hand, it makes you think you can ask for a.rts("a) and a.rts("b") safely, but on the other it is documented to be not that safe (i.e., don't call this method if you need more than one token stream from an analyzer at the same time).

          I don't know how to solve it best - I'd like to have a tokenStream method that accepts the field name, and that I can get a reused token stream, for that field name. But I also would like to have a method that I can call "get a reusable token stream" and "I don't care which field it is". So maybe have two variants:

          1. reusableTokenStream(Reader reader)
          2. reusableTokenStream(String field, Reader reader)
            This is kind of related to LUCENE-1678, as I think we'd like tokenStream to return a reused one, but maybe having a tokenStream which always returns a new one, and a reusableTokenStream (w/o a field) which reuses a stream (maybe the 'default' stream), would be good.

          What do you think?

          Show
          Shai Erera added a comment - Robert - wouldn't it make sense to pull SavedStreams (maybe call it ReusableStreams?) up to Analyzer, and have all the extensions use it? I couldn't help but notice that this code is duplicated in all the Analyzers. Also, and I don't know if it's a matter for a different issue - the fact that reusableTokenStream accepts a field name is misleading. On one hand, it makes you think you can ask for a.rts("a) and a.rts("b") safely, but on the other it is documented to be not that safe (i.e., don't call this method if you need more than one token stream from an analyzer at the same time). I don't know how to solve it best - I'd like to have a tokenStream method that accepts the field name, and that I can get a reused token stream, for that field name. But I also would like to have a method that I can call "get a reusable token stream" and "I don't care which field it is". So maybe have two variants: reusableTokenStream(Reader reader) reusableTokenStream(String field, Reader reader) This is kind of related to LUCENE-1678 , as I think we'd like tokenStream to return a reused one, but maybe having a tokenStream which always returns a new one, and a reusableTokenStream (w/o a field) which reuses a stream (maybe the 'default' stream), would be good. What do you think?
          Hide
          Robert Muir added a comment -

          Robert - wouldn't it make sense to pull SavedStreams (maybe call it ReusableStreams?) up to Analyzer, and have all the extensions use it? I couldn't help but notice that this code is duplicated in all the Analyzers.

          Shai, it would be great if somehow this could be factored. its not complete duplication: different things need to happen here: for example Thai and Smart Chinese have filters that keep state and require a reset(). But i don't know, seems like it could be factored into Analyzer and reset() called on both tokenizer and filters...

          I am trying to imagine a situation where refactoring this kind of thing would prevent some flexibility, but i think if tokenstreams keep state in some wierd way they should implement reset() for this purpose.

          Also, and I don't know if it's a matter for a different issue - the fact that reusableTokenStream accepts a field name is misleading.

          probably to support PerFieldAnalyzerWrapper is my first thought. how would PerFieldAnalyzerWrapper work correctly if field is not supplied???

          Show
          Robert Muir added a comment - Robert - wouldn't it make sense to pull SavedStreams (maybe call it ReusableStreams?) up to Analyzer, and have all the extensions use it? I couldn't help but notice that this code is duplicated in all the Analyzers. Shai, it would be great if somehow this could be factored. its not complete duplication: different things need to happen here: for example Thai and Smart Chinese have filters that keep state and require a reset(). But i don't know, seems like it could be factored into Analyzer and reset() called on both tokenizer and filters... I am trying to imagine a situation where refactoring this kind of thing would prevent some flexibility, but i think if tokenstreams keep state in some wierd way they should implement reset() for this purpose. Also, and I don't know if it's a matter for a different issue - the fact that reusableTokenStream accepts a field name is misleading. probably to support PerFieldAnalyzerWrapper is my first thought. how would PerFieldAnalyzerWrapper work correctly if field is not supplied???
          Hide
          Uwe Schindler added a comment - - edited

          different things need to happen here: for example Thai and Smart Chinese have filters that keep state and require a reset(). But i don't know, seems like it could be factored into Analyzer and reset() called on both tokenizer and filters...

          I am trying to imagine a situation where refactoring this kind of thing would prevent some flexibility, but i think if tokenstreams keep state in some wierd way they should implement reset() for this purpose.

          reset() is always called by IndexWriter before consuming the TokenStream; end() is called as last operation on the TokenStream.

          And each TokenFilter should for sure pass the call also to the input TokenStream... The default impl does this.

          Show
          Uwe Schindler added a comment - - edited different things need to happen here: for example Thai and Smart Chinese have filters that keep state and require a reset(). But i don't know, seems like it could be factored into Analyzer and reset() called on both tokenizer and filters... I am trying to imagine a situation where refactoring this kind of thing would prevent some flexibility, but i think if tokenstreams keep state in some wierd way they should implement reset() for this purpose. reset() is always called by IndexWriter before consuming the TokenStream; end() is called as last operation on the TokenStream. And each TokenFilter should for sure pass the call also to the input TokenStream... The default impl does this.
          Hide
          Robert Muir added a comment -

          reset() is always called by IndexWriter before consuming the TokenStream; end() is called as last operation on the TokenStream.

          Uwe, this may be the case for IndexWriter, but if I do not explicitly call it like so in ThaiAnalyzer, then ThaiWordFilter's reset() is not invoked:

            streams.source.reset(reader);
            streams.result.reset(); // reset the ThaiWordFilter's state <-- right here
          

          By calling reset I can ensure it happens regardless of what is consuming the tokenstream... (such as my tests!) maybe this is overkill?

          Show
          Robert Muir added a comment - reset() is always called by IndexWriter before consuming the TokenStream; end() is called as last operation on the TokenStream. Uwe, this may be the case for IndexWriter, but if I do not explicitly call it like so in ThaiAnalyzer, then ThaiWordFilter's reset() is not invoked: streams.source.reset(reader); streams.result.reset(); // reset the ThaiWordFilter's state <-- right here By calling reset I can ensure it happens regardless of what is consuming the tokenstream... (such as my tests!) maybe this is overkill?
          Hide
          Shai Erera added a comment -

          I actually meant to pull the class SavedStreams up to Analyzer, and leave the rest of the logic in each Analyzer impl (I don't think there can be a default impl for all). SavedStreams have a Tokenizer and TokenFilter, which is what every reusable token stream needs.

          As for PerFieldAnalyzerWrapper --> it will still have a reusableTS(field) version to call. But when you call this method, you guarantee you reuse a TS for that field only.

          But if IW needs to call this method for every field it parses, then it can only call reusableTokenStream(field), and therefore I wonder how that optimization work ...

          Show
          Shai Erera added a comment - I actually meant to pull the class SavedStreams up to Analyzer, and leave the rest of the logic in each Analyzer impl (I don't think there can be a default impl for all). SavedStreams have a Tokenizer and TokenFilter, which is what every reusable token stream needs. As for PerFieldAnalyzerWrapper --> it will still have a reusableTS(field) version to call. But when you call this method, you guarantee you reuse a TS for that field only. But if IW needs to call this method for every field it parses, then it can only call reusableTokenStream(field), and therefore I wonder how that optimization work ...
          Hide
          Robert Muir added a comment -

          Shai: I see what you are saying wrt SavedStreams...
          but in my opinion the biggest problem with the current setup is that it feels fragile... forcing me to create separate tests for reusableTS() to ensure its doing what TS() does.

          If you can think of a better way to implement this patch, i'd love to hear it because I'm not thrilled with what I had to do either, I just didnt see a better way.
          but really its implemented the same way the core analyzers (Standard/Simple/Stop) are done.

          Show
          Robert Muir added a comment - Shai: I see what you are saying wrt SavedStreams... but in my opinion the biggest problem with the current setup is that it feels fragile... forcing me to create separate tests for reusableTS() to ensure its doing what TS() does. If you can think of a better way to implement this patch, i'd love to hear it because I'm not thrilled with what I had to do either, I just didnt see a better way. but really its implemented the same way the core analyzers (Standard/Simple/Stop) are done.
          Hide
          Robert Muir added a comment - - edited

          I am thinking of expanding this patch to include reset() impls for state-keeping tokenizers/filters that do not currently have an analyzer...
          not really part of this issue but it presents an issue for someone trying to create a custom analyzer (with reusableTS) based on these components.

          Show
          Robert Muir added a comment - - edited I am thinking of expanding this patch to include reset() impls for state-keeping tokenizers/filters that do not currently have an analyzer... not really part of this issue but it presents an issue for someone trying to create a custom analyzer (with reusableTS) based on these components.
          Hide
          Yonik Seeley added a comment -

          I am thinking of expanding this patch to include reset() impls for state-keeping tokenizers/filters that do not currently have an analyzer...

          +1

          In the past we've always encouraged people to create their own analyzers by plugging together the provided filters - we should keep this simple to do.

          Now the problem: TokenStream.reset() has different semantics than what we are using it for here. CachingTokenFilter uses it to start the replay of the last string of tokens it saw (definitely bad for reuse).

          So.... do we redefine reset()? Something like CachingTokenFilter is very special case, and I don't feel like it should have it's own method. There are other ways it could be implemented now anyway.

          Show
          Yonik Seeley added a comment - I am thinking of expanding this patch to include reset() impls for state-keeping tokenizers/filters that do not currently have an analyzer... +1 In the past we've always encouraged people to create their own analyzers by plugging together the provided filters - we should keep this simple to do. Now the problem: TokenStream.reset() has different semantics than what we are using it for here. CachingTokenFilter uses it to start the replay of the last string of tokens it saw (definitely bad for reuse). So.... do we redefine reset()? Something like CachingTokenFilter is very special case, and I don't feel like it should have it's own method. There are other ways it could be implemented now anyway.
          Hide
          Robert Muir added a comment -

          Now the problem: TokenStream.reset() has different semantics than what we are using it for here. CachingTokenFilter uses it to start the replay of the last string of tokens it saw (definitely bad for reuse).

          Hmm, I have not looked at CachingTokenFilter, sounds like an issue there.
          In this case I was referring to state-keeping tokenizers/filters without reset() impls in contrib: Ngram, shingles, things like that.
          I looked at the core tokenstreams and most of those seemed to properly support it... I guess we have one exception

          Show
          Robert Muir added a comment - Now the problem: TokenStream.reset() has different semantics than what we are using it for here. CachingTokenFilter uses it to start the replay of the last string of tokens it saw (definitely bad for reuse). Hmm, I have not looked at CachingTokenFilter, sounds like an issue there. In this case I was referring to state-keeping tokenizers/filters without reset() impls in contrib: Ngram, shingles, things like that. I looked at the core tokenstreams and most of those seemed to properly support it... I guess we have one exception
          Hide
          Yonik Seeley added a comment -

          For something like CachingTokenFilter to work, the implementation of all other TokenFilters.reset() is still what we want - "get rid of your state because we are starting over". So it's really only the caching-type filters that have a semantic clash... (i.e. does reset() mean replay what I've seen before, or does reset mean start over with new input) and perhaps we can start off by saying "don't use these in conjunction with reusable token streams".

          Show
          Yonik Seeley added a comment - For something like CachingTokenFilter to work, the implementation of all other TokenFilters.reset() is still what we want - "get rid of your state because we are starting over". So it's really only the caching-type filters that have a semantic clash... (i.e. does reset() mean replay what I've seen before, or does reset mean start over with new input) and perhaps we can start off by saying "don't use these in conjunction with reusable token streams".
          Hide
          Robert Muir added a comment -

          yonik, I see your point.

          "get rid of your state because we are starting over" just happens to equal "Resets this stream to the beginning." 90% of the time,
          but sometimes these things are different...

          I would really like to see it easier to have reusableStreams in the future, both for a person writing a custom analyzer and in an "automatic" case like TokenizerChain

          Show
          Robert Muir added a comment - yonik, I see your point. "get rid of your state because we are starting over" just happens to equal "Resets this stream to the beginning." 90% of the time, but sometimes these things are different... I would really like to see it easier to have reusableStreams in the future, both for a person writing a custom analyzer and in an "automatic" case like TokenizerChain
          Hide
          Robert Muir added a comment -

          add reset() impls for ngram/* and compound/*. These still need tests, and still a few more to be done.

          Show
          Robert Muir added a comment - add reset() impls for ngram/* and compound/*. These still need tests, and still a few more to be done.
          Hide
          DM Smith added a comment -

          If CachingTokenFilter.reset() means rewind, then how does one reset it?

          Without that, it is not reusable.

          When I did the prior token/filter changes, IIRC, some were not reusable. Maybe reset needs to be documented that for reuse, the common case, it means reset (or no-op) and otherwise means rewind. And then document which classes are not reusable.

          Or should these single use classes be made reusable? This would argue for a rewind() method. IMHO it is a bug that reset does not follow the contract.

          Show
          DM Smith added a comment - If CachingTokenFilter.reset() means rewind, then how does one reset it? Without that, it is not reusable. When I did the prior token/filter changes, IIRC, some were not reusable. Maybe reset needs to be documented that for reuse, the common case, it means reset (or no-op) and otherwise means rewind. And then document which classes are not reusable. Or should these single use classes be made reusable? This would argue for a rewind() method. IMHO it is a bug that reset does not follow the contract.
          Hide
          Mark Miller added a comment -

          This would argue for a rewind() method.

          Is that a crazy idea? It almost seems like we should have reset, requiring an impl now, and add rewind, making it optional as reset used to be?

          Barring the back compat issues, isn't that how this should be setup?

          Show
          Mark Miller added a comment - This would argue for a rewind() method. Is that a crazy idea? It almost seems like we should have reset, requiring an impl now, and add rewind, making it optional as reset used to be? Barring the back compat issues, isn't that how this should be setup?
          Hide
          Robert Muir added a comment -

          personally I like this idea. CachingTokenFilter is not final though, so you are right back compat will need some thought, but it makes sense.

          Show
          Robert Muir added a comment - personally I like this idea. CachingTokenFilter is not final though, so you are right back compat will need some thought, but it makes sense.
          Hide
          Yonik Seeley added a comment -

          I don't think we need a rewind() at all on TokenStream - it's too special-case.

          Show
          Yonik Seeley added a comment - I don't think we need a rewind() at all on TokenStream - it's too special-case.
          Hide
          Mark Miller added a comment -

          So what do you propose ? You would just cast to CachingTokenFilter if you want rewind?

          I've never really been a fan of these methods that are optionally implemented ... and you must know the type if you know it can rewind.

          So what about making reset required (and work the same across all TokenStreams) and adding rewind to CachingTokenFilter then?

          Show
          Mark Miller added a comment - So what do you propose ? You would just cast to CachingTokenFilter if you want rewind? I've never really been a fan of these methods that are optionally implemented ... and you must know the type if you know it can rewind. So what about making reset required (and work the same across all TokenStreams) and adding rewind to CachingTokenFilter then?
          Hide
          Mark Miller added a comment -

          Or even a Rewindable interface that can implemented if a TokenStream supports Rewind?

          Then, if you really needed it, you could use InstanceOf to check for support.

          Show
          Mark Miller added a comment - Or even a Rewindable interface that can implemented if a TokenStream supports Rewind? Then, if you really needed it, you could use InstanceOf to check for support.
          Hide
          Yonik Seeley added a comment -

          It depends on the use case for CachingTokenFilter.
          When it's used in places like QueryParser.getFieldQuery(), the consumer creates the CachingTokenFilter and can rewind it too.

          If one has managed to use the same instance more than once in the same document, other tricks could be used such as resetting to the beginning after false is returned from incrementToken() or implementing rewind in end(). Seems like either would work.

          But in reality, the concept of CachingTokenFilter isn't really compatible with the concept of reuse at all... so I don't think we necessarily need to do anything except document that it's not reusable. Adding rewind() to TokenStream won't solve this semantic problem.

          Show
          Yonik Seeley added a comment - It depends on the use case for CachingTokenFilter. When it's used in places like QueryParser.getFieldQuery(), the consumer creates the CachingTokenFilter and can rewind it too. If one has managed to use the same instance more than once in the same document, other tricks could be used such as resetting to the beginning after false is returned from incrementToken() or implementing rewind in end(). Seems like either would work. But in reality, the concept of CachingTokenFilter isn't really compatible with the concept of reuse at all... so I don't think we necessarily need to do anything except document that it's not reusable. Adding rewind() to TokenStream won't solve this semantic problem.
          Hide
          Robert Muir added a comment -

          add reusable/reset impls for shingles, snowball, and memory/synonym.
          memory/synonym had no previous tests afaik.
          tests are still needed for compound,ngram, and shingles reset()
          memory/PatternAnalyzer still does not use reusableTS
          and there are two wrappers: shingle/ShingleAnalyzerWrapper and query/QueryAutoStopWordAnalyzer that should be fixed and tested.

          unfortunately something came up at work, so I may be slow on this, if you want to jump in, please help!
          and let me know what you are tackling, I will do my best to work this issue late night to get it resolved.

          Show
          Robert Muir added a comment - add reusable/reset impls for shingles, snowball, and memory/synonym. memory/synonym had no previous tests afaik. tests are still needed for compound,ngram, and shingles reset() memory/PatternAnalyzer still does not use reusableTS and there are two wrappers: shingle/ShingleAnalyzerWrapper and query/QueryAutoStopWordAnalyzer that should be fixed and tested. unfortunately something came up at work, so I may be slow on this, if you want to jump in, please help! and let me know what you are tackling, I will do my best to work this issue late night to get it resolved.
          Hide
          Robert Muir added a comment -

          with ShingleAnalyzerWrapper and tests, plan to do QueryAutoStopWord the same way... off to work

          Show
          Robert Muir added a comment - with ShingleAnalyzerWrapper and tests, plan to do QueryAutoStopWord the same way... off to work
          Hide
          Shai Erera added a comment -

          Robert, what I meant about pulling SavedStreams up to Analyzer (few comments above) was to do something like this:

          class Analyzer {
            protected static class Streams {
              public Tokenizer tokenizer;
              public TokenStream tokenStream;
            }
            ...
          }
          
          class MyAnalyzer extends Analyzer {
            public reusableTokenStream() {
              Streams streams = getPrevTS();
              if (streams == null) {
                streams = new Streams();
                streams.tokenizer = new Tokenizer();
                streams.tokenStream = new TokenStream();
                setPrevTS(streams);
             } else {
                streams.tokenizer.reset(reader);
                streams.tokenStream.reset();
             }
             return streams.tokenStream;
          }
          

          This will just save the declaration of SavedStreams or Streams in all sub-classes. In addition we can do the following:

          1. Define reset(String, Reader) on Streams, so that everyone just calls streams.reset(), instead of resetting tokenizer and tokenStream. Streams will do that internally.
          2. Define a protected abstract getTokenizer() on Analyzer that all Analyzers implement. (due to back-compat, this can throw UOE - let's leave it for now).
          3. Have Analyzer's reusableTokenStream look like the following:
            public TokenStream reusableTokenStream(String field, Reader reader) {
              Streams streams = getPreviousTokenStream();
              if (streams == null) {
                streams = new Streams();
                streams.tokenizer = getTokenizer(field, reader);
                streams.tokenStream = tokenStream();
                setPrevTS(streams);
              } else {
                streams.reset(field, reader);
              }
              return streams.tokenStream;
            }
            

          And that can be even more simplified, by having Streams define a ctor which accepts Tokenizer and TokenStream. We can also instead of doing "new Streams()" call a method newStreams() so that sublcasses can override if they want to provide a different Streams impl. Not a must, and we might even consider the whole thing final (Streams, reusableTokenStream ? etc.)

          That will save some code in that patch I believe. What do you think?

          I haven't touched the back-compat issues yet - let's discuss the idea first.

          Show
          Shai Erera added a comment - Robert, what I meant about pulling SavedStreams up to Analyzer (few comments above) was to do something like this: class Analyzer { protected static class Streams { public Tokenizer tokenizer; public TokenStream tokenStream; } ... } class MyAnalyzer extends Analyzer { public reusableTokenStream() { Streams streams = getPrevTS(); if (streams == null ) { streams = new Streams(); streams.tokenizer = new Tokenizer(); streams.tokenStream = new TokenStream(); setPrevTS(streams); } else { streams.tokenizer.reset(reader); streams.tokenStream.reset(); } return streams.tokenStream; } This will just save the declaration of SavedStreams or Streams in all sub-classes. In addition we can do the following: Define reset(String, Reader) on Streams, so that everyone just calls streams.reset(), instead of resetting tokenizer and tokenStream. Streams will do that internally. Define a protected abstract getTokenizer() on Analyzer that all Analyzers implement. (due to back-compat, this can throw UOE - let's leave it for now). Have Analyzer's reusableTokenStream look like the following: public TokenStream reusableTokenStream( String field, Reader reader) { Streams streams = getPreviousTokenStream(); if (streams == null ) { streams = new Streams(); streams.tokenizer = getTokenizer(field, reader); streams.tokenStream = tokenStream(); setPrevTS(streams); } else { streams.reset(field, reader); } return streams.tokenStream; } And that can be even more simplified, by having Streams define a ctor which accepts Tokenizer and TokenStream. We can also instead of doing "new Streams()" call a method newStreams() so that sublcasses can override if they want to provide a different Streams impl. Not a must, and we might even consider the whole thing final (Streams, reusableTokenStream ? etc.) That will save some code in that patch I believe. What do you think? I haven't touched the back-compat issues yet - let's discuss the idea first.
          Hide
          Shai Erera added a comment -

          Also Robert, I've looked at the patch and I see that you don't call reset() on streams.result, just on source. I think you should call that on streams.result too. You do it in TestSynonymTokenFilter.

          If we go w/ my proposal above, such issues will not happen, since there will be only one copy of reusableTokenStreams (at least for the majority of analyzers).

          Show
          Shai Erera added a comment - Also Robert, I've looked at the patch and I see that you don't call reset() on streams.result, just on source. I think you should call that on streams.result too. You do it in TestSynonymTokenFilter. If we go w/ my proposal above, such issues will not happen, since there will be only one copy of reusableTokenStreams (at least for the majority of analyzers).
          Hide
          Robert Muir added a comment -

          Shai, first of all let me address your first comment... for the saved streams thing.

          one issue would be how to implement AnalyzerWrappers with that? with the existing functionality I am able to make this work. See my impl for ShingleAnalyzerWrapper for an example.

          your second comment, I only call reset() on streams.result when there is a state-keeping TokenFilter on that chain. it is not necessary to invoke it if reset() is a no-op...

          Show
          Robert Muir added a comment - Shai, first of all let me address your first comment... for the saved streams thing. one issue would be how to implement AnalyzerWrappers with that? with the existing functionality I am able to make this work. See my impl for ShingleAnalyzerWrapper for an example. your second comment, I only call reset() on streams.result when there is a state-keeping TokenFilter on that chain. it is not necessary to invoke it if reset() is a no-op...
          Hide
          Robert Muir added a comment -

          Shai thinking about this some more, as long as the behavior could be overridden for the extreme case
          i think it would make some sense to somehow have a 'default' implementation (what you are proposing does make sense)
          and it could be overridden in the strange cases if the thing is non-final.

          but now my problem child is memory/PatternAnalyzer, its source of tokens is not a Tokenizer.
          So the getTokenizer() method does not make sense in that case... (I guess it could throw UOE there but it starts to sound like we are dancing around the problem)

          Show
          Robert Muir added a comment - Shai thinking about this some more, as long as the behavior could be overridden for the extreme case i think it would make some sense to somehow have a 'default' implementation (what you are proposing does make sense) and it could be overridden in the strange cases if the thing is non-final. but now my problem child is memory/PatternAnalyzer, its source of tokens is not a Tokenizer. So the getTokenizer() method does not make sense in that case... (I guess it could throw UOE there but it starts to sound like we are dancing around the problem)
          Hide
          Shai Erera added a comment -

          I only call reset() on streams.result when there is a state-keeping TokenFilter on that chain. it is not necessary to invoke it if reset() is a no-op...

          What if one of those streams will become state-keeper some day? I don't think that calling reset() and have it done nothing will be expensive, no?

          but now my problem child is memory/PatternAnalyzer, its source of tokens is not a Tokenizer.

          It could just override reusableTokenStream and do what it wants, no?

          I must admit that I still have in mind the current TokenStream and Analyzer API. Therefore my suggestion may not be 100% compatible w/ AttributeSource and the new stuff. But my gut feeling tells me there has to be a way to remove all those unnecessary impls in all Analyzers. The following default impl seems too obvious than to say we cannot do it:

          protected TokenStream internalTokenStream() {
            // do something
          }
          
          protected Tokenizer getTokenizer() {
            // do something
          }
          
          public TokenStream tokenStream() {
            TokenStream result = getTokenizer();
            result = internalTokenStream(result);
            return result;
          }
          
          public TokenStream reusableTokenStream() {
            Streams streams = getPrevTS();
            if (streams == null) {
              streams = new Streams();
              streams.tokenizer = getTokenizer();
              streams.tokenStream = internalTokenStream(streams.tokenizer);
              setPrevTS(streams);
            } else {
              streams.reset();
            } 
            return streams.tokenStream;
          }
          

          I'll try to impl it tomorrow, using the new API and see how it goes.

          Show
          Shai Erera added a comment - I only call reset() on streams.result when there is a state-keeping TokenFilter on that chain. it is not necessary to invoke it if reset() is a no-op... What if one of those streams will become state-keeper some day? I don't think that calling reset() and have it done nothing will be expensive, no? but now my problem child is memory/PatternAnalyzer, its source of tokens is not a Tokenizer. It could just override reusableTokenStream and do what it wants, no? I must admit that I still have in mind the current TokenStream and Analyzer API. Therefore my suggestion may not be 100% compatible w/ AttributeSource and the new stuff. But my gut feeling tells me there has to be a way to remove all those unnecessary impls in all Analyzers. The following default impl seems too obvious than to say we cannot do it: protected TokenStream internalTokenStream() { // do something } protected Tokenizer getTokenizer() { // do something } public TokenStream tokenStream() { TokenStream result = getTokenizer(); result = internalTokenStream(result); return result; } public TokenStream reusableTokenStream() { Streams streams = getPrevTS(); if (streams == null ) { streams = new Streams(); streams.tokenizer = getTokenizer(); streams.tokenStream = internalTokenStream(streams.tokenizer); setPrevTS(streams); } else { streams.reset(); } return streams.tokenStream; } I'll try to impl it tomorrow, using the new API and see how it goes.
          Hide
          Robert Muir added a comment -

          What if one of those streams will become state-keeper some day? I don't think that calling reset() and have it done nothing will be expensive, no?

          it could also become completely incompatible with reuse in some way, (example: CachingTokenFilter). in that case reset will not help either

          It could just override reusableTokenStream and do what it wants, no?

          Oh, definitely, my concern was what should getTokenizer() do in that case? I guess as long as getTokenizer() is not public and is documented as optional operation, then I would not have a problem with this case...

          I'll try to impl it tomorrow, using the new API and see how it goes.

          please do, I think its a great idea! I am just bringing up the ridiculous cases in contrib, as long as its flexible enough that things can be overridden in those cases, it could save a lot of heartache and maintenance hassle.

          do you have any ideas on the back compat issues?

          Show
          Robert Muir added a comment - What if one of those streams will become state-keeper some day? I don't think that calling reset() and have it done nothing will be expensive, no? it could also become completely incompatible with reuse in some way, (example: CachingTokenFilter). in that case reset will not help either It could just override reusableTokenStream and do what it wants, no? Oh, definitely, my concern was what should getTokenizer() do in that case? I guess as long as getTokenizer() is not public and is documented as optional operation, then I would not have a problem with this case... I'll try to impl it tomorrow, using the new API and see how it goes. please do, I think its a great idea! I am just bringing up the ridiculous cases in contrib, as long as its flexible enough that things can be overridden in those cases, it could save a lot of heartache and maintenance hassle. do you have any ideas on the back compat issues?
          Hide
          Shai Erera added a comment -

          We only need getTokenizer because TokenStream.reset() does not accept a Reader. If we could introduce such method on TokenStream, we wouldn't need to refer to Tokenizer directly.

          do you have any ideas on the back compat issues?

          Well it's a bit trickier ... today we call reusableTokenStream in our indexing code, and either get a new instance, or a reused instance. We cannot change Analyzer's default behavior, which returns a new instance (unless we're willing to break back-compat), because Analyzers that did not override reusableTokenStream, may break if we start reusing the instance by default (for example if they add two fields to a document w/ reusableTokenStream called twice).

          Also, deprecate reusableTokenStream and define a new one (say reuseTokenStream), and move to use it is not good either, since we want its default impl to reuse the token stream, and impls that did not override it may break.

          So how about if we create a new abstract ReusingAnalyzer which impls reusableTokenStream to always reuse it. And we add Streams to Analyzer as a protected static class. That way, Analyzers that don't care about reuse, can still extend Analyzer. Analyzers which care about reuse and are fine w/ ReusingAnalyzer's impl, can move to extend it. And Analyzers that care about reuse but want their reuse to be done differently can choose to extend ReusingAnalyzer, or Analyzer.

          Back-compat wise, we're safe since:

          1. Existing Lucene Analyzers that reuse can be changed to extend ReusingAnalyzer.
          2. Existing Analyzers (outside Lucene code) either override or not reusableTokenStream, and therefore won't break.
          3. Our indexing code will still call reusableTokenStream, no change here.
          4. Any code out there which traverses an Analyzer by calling reusableTokenStream does not need to change anything.

          I think that'd work?

          Show
          Shai Erera added a comment - We only need getTokenizer because TokenStream.reset() does not accept a Reader. If we could introduce such method on TokenStream, we wouldn't need to refer to Tokenizer directly. do you have any ideas on the back compat issues? Well it's a bit trickier ... today we call reusableTokenStream in our indexing code, and either get a new instance, or a reused instance. We cannot change Analyzer's default behavior, which returns a new instance (unless we're willing to break back-compat), because Analyzers that did not override reusableTokenStream, may break if we start reusing the instance by default (for example if they add two fields to a document w/ reusableTokenStream called twice). Also, deprecate reusableTokenStream and define a new one (say reuseTokenStream), and move to use it is not good either, since we want its default impl to reuse the token stream, and impls that did not override it may break. So how about if we create a new abstract ReusingAnalyzer which impls reusableTokenStream to always reuse it. And we add Streams to Analyzer as a protected static class. That way, Analyzers that don't care about reuse, can still extend Analyzer. Analyzers which care about reuse and are fine w/ ReusingAnalyzer's impl, can move to extend it. And Analyzers that care about reuse but want their reuse to be done differently can choose to extend ReusingAnalyzer, or Analyzer. Back-compat wise, we're safe since: Existing Lucene Analyzers that reuse can be changed to extend ReusingAnalyzer. Existing Analyzers (outside Lucene code) either override or not reusableTokenStream, and therefore won't break. Our indexing code will still call reusableTokenStream, no change here. Any code out there which traverses an Analyzer by calling reusableTokenStream does not need to change anything. I think that'd work?
          Hide
          Robert Muir added a comment -

          Shai, works for me.

          i will keep working on this patch, but if you get ReusingAnalyzer together, i can easily move 95% of the code to it
          it is doing the tests that take forever anyway and they will not change.

          thank you!

          Show
          Robert Muir added a comment - Shai, works for me. i will keep working on this patch, but if you get ReusingAnalyzer together, i can easily move 95% of the code to it it is doing the tests that take forever anyway and they will not change. thank you!
          Hide
          Robert Muir added a comment -

          the tests will not pass for upcoming patch unless LUCENE-1801 is applied first.

          this is because i test that if you interrupt a shinglefilter, the type is correct (and it is currently not being cleared correctly).

          Show
          Robert Muir added a comment - the tests will not pass for upcoming patch unless LUCENE-1801 is applied first. this is because i test that if you interrupt a shinglefilter, the type is correct (and it is currently not being cleared correctly).
          Hide
          Robert Muir added a comment -

          rest of the analyzers and tests, except memory/PatternAnalyzer. The design of the tokenstreams used by this analyzer does not support reusableTS and it will need to be redesigned to make this happen.

          Show
          Robert Muir added a comment - rest of the analyzers and tests, except memory/PatternAnalyzer. The design of the tokenstreams used by this analyzer does not support reusableTS and it will need to be redesigned to make this happen.
          Hide
          Yonik Seeley added a comment -

          Patch looks good - do you plan on committing soon Robert?

          Show
          Yonik Seeley added a comment - Patch looks good - do you plan on committing soon Robert?
          Hide
          Robert Muir added a comment -

          Yonik, thanks for reviewing it.
          I wanted to wait a bit and see if Shai wanted to give a crack at ReusingAnalyzer, but we could do that as a separate issue and then refactor code to use it?

          Show
          Robert Muir added a comment - Yonik, thanks for reviewing it. I wanted to wait a bit and see if Shai wanted to give a crack at ReusingAnalyzer, but we could do that as a separate issue and then refactor code to use it?
          Hide
          Yonik Seeley added a comment -

          Yes, I think we should just commit this now - the most important part is that people can create their own reusable tokenstreams from Lucene's tokenizers and token filters. Making an easier to use ReusingAnalyzer can be a separate issue.

          Show
          Yonik Seeley added a comment - Yes, I think we should just commit this now - the most important part is that people can create their own reusable tokenstreams from Lucene's tokenizers and token filters. Making an easier to use ReusingAnalyzer can be a separate issue.
          Hide
          Robert Muir added a comment -

          Yonik, ok, I will look over the patch again, but I plan on committing this tonight or tomorrow if nothing comes up.

          Show
          Robert Muir added a comment - Yonik, ok, I will look over the patch again, but I plan on committing this tonight or tomorrow if nothing comes up.
          Hide
          Shai Erera added a comment -

          Apologies for the late post, I had a busy weekend. Attached patch includes ReusingAnalyzer, Streams in Analyzer and javadocs.

          Robert, please have a look. I think extending it should be fairly straightforward and we can probably finish the integration in a couple of days. However if you discover it isn't the case, we can separate it into a different issue.

          Also, I did not include a note in CHANGES. Once you're done merging it into the larger patch, I can help w/ the javadocs and CHANGES if required.

          Show
          Shai Erera added a comment - Apologies for the late post, I had a busy weekend. Attached patch includes ReusingAnalyzer, Streams in Analyzer and javadocs. Robert, please have a look. I think extending it should be fairly straightforward and we can probably finish the integration in a couple of days. However if you discover it isn't the case, we can separate it into a different issue. Also, I did not include a note in CHANGES. Once you're done merging it into the larger patch, I can help w/ the javadocs and CHANGES if required.
          Hide
          Yonik Seeley added a comment -

          Perhaps the Streams class should be part of ReusingAnalyzer and not Analyzer? It's a specific implementation of a reusable token stream, not part of the Analyzer interface.

          Show
          Yonik Seeley added a comment - Perhaps the Streams class should be part of ReusingAnalyzer and not Analyzer? It's a specific implementation of a reusable token stream, not part of the Analyzer interface.
          Hide
          Shai Erera added a comment -

          Well ... it's true and false at the same time. On one hand, I think Analyzer should impl reusableTokenStream just like ReusingAnalyzer, but we can't do that because of back-compat. On the other hand, Streams does belong to ReusingAnalyzer because it makes use of it.

          What I thought was that maybe someone would want to make use of Streams w/o extending Analyzer. And ... we may want to constraint setPreviousTokenStream to Streams, or TokenStream or a generic type of thing, to avoid casting and be more type-safe.

          I wonder if we'll stay w/ Analyzer.reusableTS as it is forever, or will we break it one day to be like ReusingAnalyzer (and by that deprecate ReusingAnalyzer?).

          I guess that if we think for the long term that ReusingAnalyzer will stay, and hence most Analyzers will actually be ReusingAnalyzer extension, then I'm ok w/ moving Streams into ReusingAnalyzer. But keeping it in Analyzer will allow us in the future to constrain prevTokenStream to be of that type and not a generic Object.

          Show
          Shai Erera added a comment - Well ... it's true and false at the same time. On one hand, I think Analyzer should impl reusableTokenStream just like ReusingAnalyzer, but we can't do that because of back-compat. On the other hand, Streams does belong to ReusingAnalyzer because it makes use of it. What I thought was that maybe someone would want to make use of Streams w/o extending Analyzer. And ... we may want to constraint setPreviousTokenStream to Streams, or TokenStream or a generic type of thing, to avoid casting and be more type-safe. I wonder if we'll stay w/ Analyzer.reusableTS as it is forever, or will we break it one day to be like ReusingAnalyzer (and by that deprecate ReusingAnalyzer?). I guess that if we think for the long term that ReusingAnalyzer will stay, and hence most Analyzers will actually be ReusingAnalyzer extension, then I'm ok w/ moving Streams into ReusingAnalyzer. But keeping it in Analyzer will allow us in the future to constrain prevTokenStream to be of that type and not a generic Object.
          Hide
          Robert Muir added a comment -

          Shai, I will take a look at your patch as soon as I am at a real computer. thanks for your work in advance, we maybe should put it on another issue though just to keep the scope of this one reasonably contained.

          And ... we may want to constraint setPreviousTokenStream to Streams, or TokenStream or a generic type of thing, to avoid casting and be more type-safe.

          see QueryAutoStopWordAnalyzer in my patch for a counter-example to this. in this case, it is a Set, because it is dependent upon field.

          Show
          Robert Muir added a comment - Shai, I will take a look at your patch as soon as I am at a real computer. thanks for your work in advance, we maybe should put it on another issue though just to keep the scope of this one reasonably contained. And ... we may want to constraint setPreviousTokenStream to Streams, or TokenStream or a generic type of thing, to avoid casting and be more type-safe. see QueryAutoStopWordAnalyzer in my patch for a counter-example to this. in this case, it is a Set, because it is dependent upon field.
          Hide
          Yonik Seeley added a comment -

          In general, we should strive to treat our base abstract classes like interfaces, with the ability to provide default implementations to avoid back compatibility breaks (while avoiding adding members or non-overrideable methods). One could make the case that the ClosableThreadLocal should not be in Analyzer either, but it's been there long enough now, it would break back compat to move it.

          What I thought was that maybe someone would want to make use of Streams w/o extending Analyzer.

          They still can - ReusableAnalyzer.Streams.

          But keeping it in Analyzer will allow us in the future to constrain prevTokenStream to be of that type and not a generic Object.

          Doesn't seem like we should force all tokenstreams to be reusable, or constrain the exact form of how a reusable token stream is obtained.

          Show
          Yonik Seeley added a comment - In general, we should strive to treat our base abstract classes like interfaces, with the ability to provide default implementations to avoid back compatibility breaks (while avoiding adding members or non-overrideable methods). One could make the case that the ClosableThreadLocal should not be in Analyzer either, but it's been there long enough now, it would break back compat to move it. What I thought was that maybe someone would want to make use of Streams w/o extending Analyzer. They still can - ReusableAnalyzer.Streams. But keeping it in Analyzer will allow us in the future to constrain prevTokenStream to be of that type and not a generic Object. Doesn't seem like we should force all tokenstreams to be reusable, or constrain the exact form of how a reusable token stream is obtained.
          Hide
          Shai Erera added a comment -

          I guess you're both right. I thought that one day we'll cancel ReusingAnalyzer and pull it up to Analyzer, but it looks like ReusingAnalyzer makes sense to stay, and so we can move Streams to it.

          Robert, if possible, I'd like to get this one in as part of this issue. The reason is that you already modified all Analyzers to impl reusableTokenStream. I'm afraid that if we'll do it in another issue, some Analyzers will be skipped over. If you want, I can apply this to your patch and post pack an updated one tomorrow.

          Show
          Shai Erera added a comment - I guess you're both right. I thought that one day we'll cancel ReusingAnalyzer and pull it up to Analyzer, but it looks like ReusingAnalyzer makes sense to stay, and so we can move Streams to it. Robert, if possible, I'd like to get this one in as part of this issue. The reason is that you already modified all Analyzers to impl reusableTokenStream. I'm afraid that if we'll do it in another issue, some Analyzers will be skipped over. If you want, I can apply this to your patch and post pack an updated one tomorrow.
          Hide
          Mark Miller added a comment -

          To not break back compat, everything has got to work even if they don't yet move from the deprecated method.

          Show
          Mark Miller added a comment - To not break back compat, everything has got to work even if they don't yet move from the deprecated method.
          Hide
          Robert Muir added a comment -

          Robert, if possible, I'd like to get this one in as part of this issue. The reason is that you already modified all Analyzers to impl reusableTokenStream. I'm afraid that if we'll do it in another issue, some Analyzers will be skipped over. If you want, I can apply this to your patch and post pack an updated one tomorrow.

          Shai, this is a valid concern. But also lets not forget analyzers that already implement reusableTS that are not a part of this patch (yet should be changed to extend ReusingAnalyzer)... examples include collation/* analyzers/fa, etc.

          But even before this I think we should make sure everyone is happy with ReusingAnalyzer itself... this is the only reason I think it might merit another issue... this patch is already a little unwieldy because I crept the scope to include reset(Reader) and reset() methods for tokenstreams that keep state...

          Show
          Robert Muir added a comment - Robert, if possible, I'd like to get this one in as part of this issue. The reason is that you already modified all Analyzers to impl reusableTokenStream. I'm afraid that if we'll do it in another issue, some Analyzers will be skipped over. If you want, I can apply this to your patch and post pack an updated one tomorrow. Shai, this is a valid concern. But also lets not forget analyzers that already implement reusableTS that are not a part of this patch (yet should be changed to extend ReusingAnalyzer)... examples include collation/* analyzers/fa, etc. But even before this I think we should make sure everyone is happy with ReusingAnalyzer itself... this is the only reason I think it might merit another issue... this patch is already a little unwieldy because I crept the scope to include reset(Reader) and reset() methods for tokenstreams that keep state...
          Hide
          Yonik Seeley added a comment -

          But even before this I think we should make sure everyone is happy with ReusingAnalyzer itself... this is the only reason I think it might merit another issue

          +1

          The ReusingAnalyzer brings up other issues of protocol - right now consumers like lucene indexing call reset() on the stream, but I see the prototype ReusingAnalyzer also calling reset() on the stream.

          Show
          Yonik Seeley added a comment - But even before this I think we should make sure everyone is happy with ReusingAnalyzer itself... this is the only reason I think it might merit another issue +1 The ReusingAnalyzer brings up other issues of protocol - right now consumers like lucene indexing call reset() on the stream, but I see the prototype ReusingAnalyzer also calling reset() on the stream.
          Hide
          Shai Erera added a comment -

          right now consumers like lucene indexing call reset() on the stream, but I see the prototype ReusingAnalyzer also calling reset() on the stream.

          I don't think that's a new problem - I simply coded what I think most Analyzers that do impl reusableTS do. And if there are reusableTS impls that don't call reset() on purpose, then we shouldn't call it.

          Therefore, I think that we should change our code to not call reset(). I don't think there's a reusableTS impl which does not call reset(), because it relies on the consumer to do it (nobody guarantees that anyway). We should simply note that on reusableTS javadoc (e.g., something like "return an already reset token stream"). I don't mind doing that in a separate issue if that's what you prefer.

          Show
          Shai Erera added a comment - right now consumers like lucene indexing call reset() on the stream, but I see the prototype ReusingAnalyzer also calling reset() on the stream. I don't think that's a new problem - I simply coded what I think most Analyzers that do impl reusableTS do. And if there are reusableTS impls that don't call reset() on purpose, then we shouldn't call it. Therefore, I think that we should change our code to not call reset(). I don't think there's a reusableTS impl which does not call reset(), because it relies on the consumer to do it (nobody guarantees that anyway). We should simply note that on reusableTS javadoc (e.g., something like "return an already reset token stream"). I don't mind doing that in a separate issue if that's what you prefer.
          Hide
          Robert Muir added a comment -

          i would like to commit this as is later today and we create a separate issue for improving reusability: ReusingAnalyzer, figure out who calls reset(), that sort of thing?

          Show
          Robert Muir added a comment - i would like to commit this as is later today and we create a separate issue for improving reusability: ReusingAnalyzer, figure out who calls reset(), that sort of thing?
          Hide
          Shai Erera added a comment -

          Works for me. +1

          Show
          Shai Erera added a comment - Works for me. +1
          Hide
          Robert Muir added a comment -

          Committed revision 804680.

          Show
          Robert Muir added a comment - Committed revision 804680.
          Hide
          Robert Muir added a comment -

          there is a small backwards compatibility problem here.
          some of the analyzers, such as GermanAnalyzer, have a setStemExclusionTable method.

          with reusableTS, if someone changes the Stem Exclusion table, it will not take effect instantly like it did before.
          so for these, if someone changes it, we should blow away saved streams I think: setPreviousTokenStream(null)

          patch and tests coming.

          Show
          Robert Muir added a comment - there is a small backwards compatibility problem here. some of the analyzers, such as GermanAnalyzer, have a setStemExclusionTable method. with reusableTS, if someone changes the Stem Exclusion table, it will not take effect instantly like it did before. so for these, if someone changes it, we should blow away saved streams I think: setPreviousTokenStream(null) patch and tests coming.
          Hide
          Robert Muir added a comment -

          fix + tests for br, de, fr, and nl analyzers.

          Show
          Robert Muir added a comment - fix + tests for br, de, fr, and nl analyzers.
          Hide
          Robert Muir added a comment -

          i plan to commit this fix at the end of the day if nobody objects to it.

          Show
          Robert Muir added a comment - i plan to commit this fix at the end of the day if nobody objects to it.
          Hide
          Michael Busch added a comment -

          Patch looks good!

          Show
          Michael Busch added a comment - Patch looks good!
          Hide
          Robert Muir added a comment -

          Michael, thanks for the review.

          But I am looking over all the analyzers again one last time, and I think i found another devious one:
          CzechAnalyzer has a loadStopWords() method which is not a utility method, it replaces the stoptable.

          So I will go thru these again (very slowly), and upload another patch.

          Sorry for missing these the first time.

          Show
          Robert Muir added a comment - Michael, thanks for the review. But I am looking over all the analyzers again one last time, and I think i found another devious one: CzechAnalyzer has a loadStopWords() method which is not a utility method, it replaces the stoptable. So I will go thru these again (very slowly), and upload another patch. Sorry for missing these the first time.
          Hide
          Robert Muir added a comment -

          null out saved streams in the case someone calls CzechAnalyzer loadStopWords(),
          so that new stopwords are applied immediately with reusableTS.

          Show
          Robert Muir added a comment - null out saved streams in the case someone calls CzechAnalyzer loadStopWords(), so that new stopwords are applied immediately with reusableTS.
          Hide
          Robert Muir added a comment -

          Committed revision 805766.

          Show
          Robert Muir added a comment - Committed revision 805766.

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development