Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.3
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that could siphon off certain tokens and store them in a buffer to be used later in the processing pipeline.

      For example, if you want to have two fields, one lowercased and one not, but all the other analysis is the same, then you could save off the tokens to be output for a different field.

      Patch to follow, but I am still not sure about a couple of things, mostly how it plays with the new reuse API.

      See http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

      1. LUCENE-1058.patch
        11 kB
        Grant Ingersoll
      2. LUCENE-1058.patch
        11 kB
        Grant Ingersoll
      3. LUCENE-1058.patch
        29 kB
        Grant Ingersoll
      4. LUCENE-1058.patch
        29 kB
        Grant Ingersoll
      5. LUCENE-1058.patch
        28 kB
        Grant Ingersoll
      6. LUCENE-1058.patch
        17 kB
        Grant Ingersoll
      7. LUCENE-1058.patch
        13 kB
        Grant Ingersoll

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          First draft at a patch, provides two different approaches:

          1. CachedAnalyzer and CachedTokenizer take in a list of Tokens and output them as appropriate. Similar to CachingTokenFilter, but assumes you already have the Tokens

          2. In contrib/analyzers/buffered, add CollaboratingAnalyzer and related classes for creating a Analyzer, etc. that work in the stream.

          Still not sure if and how this plays with the Token reuse (I think it doesn't)

          Show
          Grant Ingersoll added a comment - First draft at a patch, provides two different approaches: 1. CachedAnalyzer and CachedTokenizer take in a list of Tokens and output them as appropriate. Similar to CachingTokenFilter, but assumes you already have the Tokens 2. In contrib/analyzers/buffered, add CollaboratingAnalyzer and related classes for creating a Analyzer, etc. that work in the stream. Still not sure if and how this plays with the Token reuse (I think it doesn't)
          Hide
          Michael McCandless added a comment -

          I think the discussion in LUCENE-1063 is relevant to this issue: if you store (& re-use) Tokens you may need to return a copy of the Token from the next() method to ensure that nay filters that alter the Token don't mess up your private copy.

          Show
          Michael McCandless added a comment - I think the discussion in LUCENE-1063 is relevant to this issue: if you store (& re-use) Tokens you may need to return a copy of the Token from the next() method to ensure that nay filters that alter the Token don't mess up your private copy.
          Hide
          Grant Ingersoll added a comment -

          Some javadoc comments for the modifyToken method in BufferingTokenFilter should be sufficient, right? Something to the effect that if this TokenFilter is not the last in the chain that it should make a full copy.

          As for the CachedTokenizer and CachedAnalyzer, those should be implied, since the user is passing them in to begin with.

          The other thing of interest, is that calling Analyzer.tokenStream(String, Reader) is not needed. In fact, this somewhat suggests having a new Fieldable property akin to tokenStreamValue(), etc. that says don't even ask the Fieldable for a value.

          Let me take a crack at what that means and post a patch. It will mean some changes to invertField() in DocumentsWriter and possibly changing it to not require that one of tokenStreamValue, readerValue() or stringValue() be defined. Not sure if that is a good idea or not.

          Show
          Grant Ingersoll added a comment - Some javadoc comments for the modifyToken method in BufferingTokenFilter should be sufficient, right? Something to the effect that if this TokenFilter is not the last in the chain that it should make a full copy. As for the CachedTokenizer and CachedAnalyzer, those should be implied, since the user is passing them in to begin with. The other thing of interest, is that calling Analyzer.tokenStream(String, Reader) is not needed. In fact, this somewhat suggests having a new Fieldable property akin to tokenStreamValue(), etc. that says don't even ask the Fieldable for a value. Let me take a crack at what that means and post a patch. It will mean some changes to invertField() in DocumentsWriter and possibly changing it to not require that one of tokenStreamValue, readerValue() or stringValue() be defined. Not sure if that is a good idea or not.
          Hide
          Grant Ingersoll added a comment -

          Here's a patch that modifies the DocumentsWriter to not throw an IllegalArgumentException if no Reader is specified. Thus, an Analyzer needs to be able to handle a null Reader (this still needs to be documented). Basically, the semantics of it are that the Analyzer is producing Tokens from some other means. I probably should spell this out in a new Field constructor as well, but this should suffice for now, and I will revisit it after the break.

          I also added in a TestCollaboratingAnalyzer. All tests pass.

          Show
          Grant Ingersoll added a comment - Here's a patch that modifies the DocumentsWriter to not throw an IllegalArgumentException if no Reader is specified. Thus, an Analyzer needs to be able to handle a null Reader (this still needs to be documented). Basically, the semantics of it are that the Analyzer is producing Tokens from some other means. I probably should spell this out in a new Field constructor as well, but this should suffice for now, and I will revisit it after the break. I also added in a TestCollaboratingAnalyzer. All tests pass.
          Hide
          Grant Ingersoll added a comment -

          A new version of this with the following changes/additions:

          DocumentsWriter no longer requires that a Field have a value (i.e. stringValue, etc.) Added a new Field constructor that allows for the construction of a Field without a value. This would allow for Analyzer implementations that produce their own tokens (whatever that means)

          Moved CollaboratingAnalyzer, et. al to the core under analysis.buffered as I thought these items should be in core given the changes to Field and DocsWriter.

          Note, I think this is a subtle, but important change in DocumentsWriter/Field behavior.

          Show
          Grant Ingersoll added a comment - A new version of this with the following changes/additions: DocumentsWriter no longer requires that a Field have a value (i.e. stringValue, etc.) Added a new Field constructor that allows for the construction of a Field without a value. This would allow for Analyzer implementations that produce their own tokens (whatever that means) Moved CollaboratingAnalyzer, et. al to the core under analysis.buffered as I thought these items should be in core given the changes to Field and DocsWriter. Note, I think this is a subtle, but important change in DocumentsWriter/Field behavior.
          Hide
          Grant Ingersoll added a comment - - edited

          Added some more documentation, plus a test showing it is bad to use the no value Field constructor w/o support from the Analyzer to produce tokens.

          If no objections, I will commit on Thursday or Friday of this week.

          Show
          Grant Ingersoll added a comment - - edited Added some more documentation, plus a test showing it is bad to use the no value Field constructor w/o support from the Analyzer to produce tokens. If no objections, I will commit on Thursday or Friday of this week.
          Hide
          Grant Ingersoll added a comment -

          fixed a failing test

          Show
          Grant Ingersoll added a comment - fixed a failing test
          Hide
          Michael Busch added a comment -

          Grant,

          I'm not sure why we need this patch.

          For the testcase that you're describing:

          For example, if you want to have two fields, one lowercased and one not, but all the other analysis is the same, then you could save off the tokens to be output for a different field.

          can't you simply do something like this:

          Document d = new Document();
          TokenStream t1 = new CachingTokenFilter(new WhitespaceTokenizer(reader));
          TokenStream t2 = new LowerCaseFilter(t1);
          d.add(new Field("f1", t1));
          d.add(new Field("f2", t2));
          

          Maybe I'm missing something?

          Show
          Michael Busch added a comment - Grant, I'm not sure why we need this patch. For the testcase that you're describing: For example, if you want to have two fields, one lowercased and one not, but all the other analysis is the same, then you could save off the tokens to be output for a different field. can't you simply do something like this: Document d = new Document(); TokenStream t1 = new CachingTokenFilter( new WhitespaceTokenizer(reader)); TokenStream t2 = new LowerCaseFilter(t1); d.add( new Field( "f1" , t1)); d.add( new Field( "f2" , t2)); Maybe I'm missing something?
          Hide
          Yonik Seeley added a comment -

          Maybe I'm not looking at it the right way yet, but I'm not sure this feels "right"...
          Since Field has a tokenStreamValue(), wouldn't it be easiest to just use that?
          If the tokens of two fields are related, one could just pre-analyze those fields and set the token streams appropriately. Seems more flexible and keeps any convoluted cross-field logic in the application domain.

          Show
          Yonik Seeley added a comment - Maybe I'm not looking at it the right way yet, but I'm not sure this feels "right"... Since Field has a tokenStreamValue(), wouldn't it be easiest to just use that? If the tokens of two fields are related, one could just pre-analyze those fields and set the token streams appropriately. Seems more flexible and keeps any convoluted cross-field logic in the application domain.
          Hide
          Grant Ingersoll added a comment -

          Maybe I'm missing something?

          No, I don't think you are missing anything in that use case, it's just an example of its use. And I am not totally sold on this approach, but mostly am

          I had originally considered your option, but didn't feel it was satisfactory for the case where you are extracting things like proper nouns or maybe it is generating a category value. The more general case is where not all the tokens are needed (in fact, very few are). In those cases, you have to go back through the whole list of cached tokens in order to extract the ones you want. In fact, thinking some more of on it, I am not sure my patch goes far enough in the sense that what if you want it to buffer in mid stream.

          For example, if you had:
          StandardTokenizer
          Proper Noun TF
          LowerCaseTF
          StopTF

          and Proper Noun TF is solely responsible for setting aside proper nouns as it comes across them in the stream.

          As for the convoluted cross-field logic, I don't think it is all that convoluted. There are only two fields and the implementing Analyzer takes care of all of it. Only real requirement the application has is that the fields be ordered correctly.

          I do agree somewhat about the pre-analysis approach, except for the case where there may be a large number of tokens in the source field, in which case, you are holding them around in memory (maxFieldLength mitigates to some extent.) Also, it puts the onus on the app. writer to do it, when it could be pretty straight forward for Lucene to do it w/o it's usual analysis pipeline.

          At any rate, separate of the CollaboratingAnalyzer, I do think the CachedTokenFilter is useful, especially in supporting the pre-analysis approach.

          Show
          Grant Ingersoll added a comment - Maybe I'm missing something? No, I don't think you are missing anything in that use case, it's just an example of its use. And I am not totally sold on this approach, but mostly am I had originally considered your option, but didn't feel it was satisfactory for the case where you are extracting things like proper nouns or maybe it is generating a category value. The more general case is where not all the tokens are needed (in fact, very few are). In those cases, you have to go back through the whole list of cached tokens in order to extract the ones you want. In fact, thinking some more of on it, I am not sure my patch goes far enough in the sense that what if you want it to buffer in mid stream. For example, if you had: StandardTokenizer Proper Noun TF LowerCaseTF StopTF and Proper Noun TF is solely responsible for setting aside proper nouns as it comes across them in the stream. As for the convoluted cross-field logic, I don't think it is all that convoluted. There are only two fields and the implementing Analyzer takes care of all of it. Only real requirement the application has is that the fields be ordered correctly. I do agree somewhat about the pre-analysis approach, except for the case where there may be a large number of tokens in the source field, in which case, you are holding them around in memory (maxFieldLength mitigates to some extent.) Also, it puts the onus on the app. writer to do it, when it could be pretty straight forward for Lucene to do it w/o it's usual analysis pipeline. At any rate, separate of the CollaboratingAnalyzer, I do think the CachedTokenFilter is useful, especially in supporting the pre-analysis approach.
          Hide
          Yonik Seeley added a comment -

          As for the convoluted cross-field logic, I don't think it is all that convoluted.

          But it's baked into CollaboratingAnalyzer... it seems like this is better left to the user. What if they wanted 3 fields instead of two?

          I do agree somewhat about the pre-analysis approach, except for the case where there may be a large number of tokens in the source field, in which case, you are holding them around in memory

          Isn't this what your current code does?

          Show
          Yonik Seeley added a comment - As for the convoluted cross-field logic, I don't think it is all that convoluted. But it's baked into CollaboratingAnalyzer... it seems like this is better left to the user. What if they wanted 3 fields instead of two? I do agree somewhat about the pre-analysis approach, except for the case where there may be a large number of tokens in the source field, in which case, you are holding them around in memory Isn't this what your current code does?
          Hide
          Grant Ingersoll added a comment -

          What if they wanted 3 fields instead of two?

          True. I'll have to think about a more generic approach. In some sense, I think 2 is often sufficient, but you are right it isn't totally generic in the spirit of Lucene.

          To some extent, I was thinking that this could help optimize Solr's copyField mechanism. In Solr's case, I think you often have copy fields that have marginal differences in the filters that are applied. It would be useful for Solr to be able to optimize these so that it doesn't have to go through the whole analysis chain again.

          Isn't this what your current code does?

          No, in my main use case (# of buffered tokens is << # of source tokens) the only tokens kept around is the (much) smaller subset of buffered tokens. In the pre-analysis approach you have to keep the source field tokens and the buffered tokens. Not to mention that you are increasing the work by having to iterate over the cached tokens in the list in Lucene. Thus, you have the cost of the analysis in your application plus the storage of both token lists (one large, one small, likely) then in Lucene you have the cost of iterating over two lists. In my approach, I think, you have the cost of analysis plus the cost of storage of one list of tokens (small) and the cost of iterating that list.

          Show
          Grant Ingersoll added a comment - What if they wanted 3 fields instead of two? True. I'll have to think about a more generic approach. In some sense, I think 2 is often sufficient, but you are right it isn't totally generic in the spirit of Lucene. To some extent, I was thinking that this could help optimize Solr's copyField mechanism. In Solr's case, I think you often have copy fields that have marginal differences in the filters that are applied. It would be useful for Solr to be able to optimize these so that it doesn't have to go through the whole analysis chain again. Isn't this what your current code does? No, in my main use case (# of buffered tokens is << # of source tokens) the only tokens kept around is the (much) smaller subset of buffered tokens. In the pre-analysis approach you have to keep the source field tokens and the buffered tokens. Not to mention that you are increasing the work by having to iterate over the cached tokens in the list in Lucene. Thus, you have the cost of the analysis in your application plus the storage of both token lists (one large, one small, likely) then in Lucene you have the cost of iterating over two lists. In my approach, I think, you have the cost of analysis plus the cost of storage of one list of tokens (small) and the cost of iterating that list.
          Hide
          Yonik Seeley added a comment -

          To some extent, I was thinking that this could help optimize Solr's copyField mechanism.

          Maybe... it would take quite a bit of work to automate it though I think.

          As far as pre-analysis costs. iteration is pretty much free in comparison to everything else. Memory is the big factor.

          Things like entity extraction are normally not done by lucene analyzers AFAIK... but if one wanted a framework to do that, the problem is more generic. Your really want to be able to add to multiple fields from multiple other fields.

          Show
          Yonik Seeley added a comment - To some extent, I was thinking that this could help optimize Solr's copyField mechanism. Maybe... it would take quite a bit of work to automate it though I think. As far as pre-analysis costs. iteration is pretty much free in comparison to everything else. Memory is the big factor. Things like entity extraction are normally not done by lucene analyzers AFAIK... but if one wanted a framework to do that, the problem is more generic. Your really want to be able to add to multiple fields from multiple other fields.
          Hide
          Grant Ingersoll added a comment -

          Any objection to me committing the CachedAnalyzer and CachedTokenizer pieces of this patch, as I don't think they are effected by the other parts of this and they solve the pre-analysis portion of this discussion. In the meantime, I will think some more about the generic field case, as I do think it is useful. I am also trying out some basic benchmarking on this.

          Things like entity extraction are normally not done by lucene analyzers AFAIK

          Consider yourself in the "know" now, as I have done this on a few occasions, but, yes, I do agree a one to many approach is probably better if it can be done in a generic way.

          Show
          Grant Ingersoll added a comment - Any objection to me committing the CachedAnalyzer and CachedTokenizer pieces of this patch, as I don't think they are effected by the other parts of this and they solve the pre-analysis portion of this discussion. In the meantime, I will think some more about the generic field case, as I do think it is useful. I am also trying out some basic benchmarking on this. Things like entity extraction are normally not done by lucene analyzers AFAIK Consider yourself in the "know" now, as I have done this on a few occasions, but, yes, I do agree a one to many approach is probably better if it can be done in a generic way.
          Hide
          Yonik Seeley added a comment -

          I dunno... it feels like we should have the right generic solution (many-to-many) before committing anything in this case, simply because this is all user-level code (the absence of this patch doesn't prohibit the user from doing anything... no package protected access rights are needed, etc).

          Show
          Yonik Seeley added a comment - I dunno... it feels like we should have the right generic solution (many-to-many) before committing anything in this case, simply because this is all user-level code (the absence of this patch doesn't prohibit the user from doing anything... no package protected access rights are needed, etc).
          Hide
          Michael Busch added a comment -

          I think the ideas here make sense, e. g. to have a buffering
          TokenFilter that doesn't buffer all tokens but enables the
          user to control which tokens to buffer.

          What is still not clear to me is why we have to introduce a
          new API for this and a new kind of analyzer? To allow creating
          an no-value field seems strange. Can't we achieve all this
          by using the Field(String, TokenStream) API without the
          analyzer indirection?

          The javadocs should make clear that the IndexWriter processes
          fields in the same order the user added them. So if a user
          adds TokenStream ts1 and thereafter ts2, they can be sure
          that ts1 is processed first. With that knowledge ts1 can
          buffer certain tokens that ts2 uses then. Adding even more
          fields that use the same tokens is straightforward.

          Show
          Michael Busch added a comment - I think the ideas here make sense, e. g. to have a buffering TokenFilter that doesn't buffer all tokens but enables the user to control which tokens to buffer. What is still not clear to me is why we have to introduce a new API for this and a new kind of analyzer? To allow creating an no-value field seems strange. Can't we achieve all this by using the Field(String, TokenStream) API without the analyzer indirection? The javadocs should make clear that the IndexWriter processes fields in the same order the user added them. So if a user adds TokenStream ts1 and thereafter ts2, they can be sure that ts1 is processed first. With that knowledge ts1 can buffer certain tokens that ts2 uses then. Adding even more fields that use the same tokens is straightforward.
          Hide
          Grant Ingersoll added a comment -

          OK, I am trying not be fixated on the Analyzer. I guess I haven't fully synthesized the new TokenStream use in DocsWriter

          I agree, I don't like the no-value Field, and am open to suggestions.

          So, I guess I am going to push back and ask, how would you solve the case of where you have two fields and the Analysis given by:
          source field:
          StandardTokenizer
          Proper Noun TF
          LowerCaseTF
          StopTF

          buffered1 Field:
          Proper Noun Cache TF (cache of all terms found to be proper nouns by the Proper Noun TF)

          buffered2 Field:
          All terms lower cased

          And the requirement is that you only do the Analysis phase once (i.e. for the source field) and the other two fields are from memory.

          I am just not seeing it yet, so I appreciate the explanation as it will better cement my understanding of the new Token Stream stuff and DocsWriter

          Show
          Grant Ingersoll added a comment - OK, I am trying not be fixated on the Analyzer. I guess I haven't fully synthesized the new TokenStream use in DocsWriter I agree, I don't like the no-value Field, and am open to suggestions. So, I guess I am going to push back and ask, how would you solve the case of where you have two fields and the Analysis given by: source field: StandardTokenizer Proper Noun TF LowerCaseTF StopTF buffered1 Field: Proper Noun Cache TF (cache of all terms found to be proper nouns by the Proper Noun TF) buffered2 Field: All terms lower cased And the requirement is that you only do the Analysis phase once (i.e. for the source field) and the other two fields are from memory. I am just not seeing it yet, so I appreciate the explanation as it will better cement my understanding of the new Token Stream stuff and DocsWriter
          Hide
          Michael Busch added a comment -

          We need to change the CachingTokenFilter a bit (untested code):

          public class CachingTokenFilter extends TokenFilter {
            private List cache;
            private Iterator iterator;
            
            public CachingTokenFilter(TokenStream input) {
              super(input);
              this.cache = new LinkedList();
            }
            
            public Token next() throws IOException {
              if (iterator != null) {
                if (!iterator.hasNext()) {
                  // the cache is exhausted, return null
                  return null;
                }   
                return (Token) iterator.next();
              } else {
                Token token = input.next();
                addTokenToCache(token);
                return token;
              }
            }
            
            public void reset() throws IOException {
              if(cache != null) {
              	iterator = cache.iterator();
              }
            }
            
            protected void addTokenToCache(Token token) {
              if (token != null) {
                cache.add(token);
              }
            }
          }
          

          Then you can implement the ProperNounTF:

          class ProperNounTF extends CachingTokenFilter {
            protected void addTokenToCache(Token token) {
              if (token != null && isProperNoun(token)) {
                cache.add(token);
              }
            }
            
            private boolean isProperNoun() {...}
          }  
          

          And then you add everything to Document:

          Document d = new Document();
          TokenStream properNounTf = new ProperNounTF(new StandardTokenizer(reader));
          TokenStream stdTf = new CachingTokenFilter(new StopTokenFilter(properNounTf));
          TokenStrean lowerCaseTf = new LowerCaseTokenFilter(stdTf);
          
          
          d.add(new Field("std", stdTf));
          d.add(new Field("nouns", properNounTf));
          d.add(new Field("lowerCase", lowerCaseTf));
          

          Again, this is untested, but I believe should work?

          Show
          Michael Busch added a comment - We need to change the CachingTokenFilter a bit (untested code): public class CachingTokenFilter extends TokenFilter { private List cache; private Iterator iterator; public CachingTokenFilter(TokenStream input) { super (input); this .cache = new LinkedList(); } public Token next() throws IOException { if (iterator != null ) { if (!iterator.hasNext()) { // the cache is exhausted, return null return null ; } return (Token) iterator.next(); } else { Token token = input.next(); addTokenToCache(token); return token; } } public void reset() throws IOException { if (cache != null ) { iterator = cache.iterator(); } } protected void addTokenToCache(Token token) { if (token != null ) { cache.add(token); } } } Then you can implement the ProperNounTF: class ProperNounTF extends CachingTokenFilter { protected void addTokenToCache(Token token) { if (token != null && isProperNoun(token)) { cache.add(token); } } private boolean isProperNoun() {...} } And then you add everything to Document: Document d = new Document(); TokenStream properNounTf = new ProperNounTF( new StandardTokenizer(reader)); TokenStream stdTf = new CachingTokenFilter( new StopTokenFilter(properNounTf)); TokenStrean lowerCaseTf = new LowerCaseTokenFilter(stdTf); d.add( new Field( "std" , stdTf)); d.add( new Field( "nouns" , properNounTf)); d.add( new Field( "lowerCase" , lowerCaseTf)); Again, this is untested, but I believe should work?
          Hide
          Yonik Seeley added a comment -

          Very similar to what I came up with I think... (all untested, etc)

          class ListTokenizer extends Tokenizer {
            protected List<Token> lst = new ArrayList<Token>();
            protected Iterator<Token> iter;
          
            public ListTokenizer(List<Token> input) {
              this.lst = input;
              if (this.lst==null) this.lst = new ArrayList<Token>();
            }
          
            /** only valid if tokens have not been consumed,
             * i.e. if this tokenizer is not part of another tokenstream
             */
            public List<Token> getTokens() {
              return lst;
            }
          
            public Token next(Token result) throws IOException {
              if (iter==null) iter = lst.iterator();
              return iter.next();
            }
          
            /** Override this method to cache only certain tokens, or new tokens based
             * on the old tokens.
             */
            public void add(Token t) {
              if (t==null) return;
              lst.add((Token)t.clone());
            }
          
            public void reset() throws IOException {
              iter = lst.iterator();
            }
          }
          
          class TeeTokenFilter extends TokenFilter {
            ListTokenizer sink;
          
            protected TeeTokenFilter(TokenStream input, ListTokenizer sink) {
              super(input);
              this.sink = sink;
            }
          
            public Token next(Token result) throws IOException {
              Token t = input.next(result);
              sink.add(t);
              return t;
            }
          }
          
          Show
          Yonik Seeley added a comment - Very similar to what I came up with I think... (all untested, etc) class ListTokenizer extends Tokenizer { protected List<Token> lst = new ArrayList<Token>(); protected Iterator<Token> iter; public ListTokenizer(List<Token> input) { this .lst = input; if ( this .lst== null ) this .lst = new ArrayList<Token>(); } /** only valid if tokens have not been consumed, * i.e. if this tokenizer is not part of another tokenstream */ public List<Token> getTokens() { return lst; } public Token next(Token result) throws IOException { if (iter== null ) iter = lst.iterator(); return iter.next(); } /** Override this method to cache only certain tokens, or new tokens based * on the old tokens. */ public void add(Token t) { if (t== null ) return ; lst.add((Token)t.clone()); } public void reset() throws IOException { iter = lst.iterator(); } } class TeeTokenFilter extends TokenFilter { ListTokenizer sink; protected TeeTokenFilter(TokenStream input, ListTokenizer sink) { super (input); this .sink = sink; } public Token next(Token result) throws IOException { Token t = input.next(result); sink.add(t); return t; } }
          Hide
          Yonik Seeley added a comment -

          I think having the "tee" solves the many-to-many case... you can have many fields contribute tokens to a new field.

          ListTokenizer sink1 = new ListTokenizer(null);
          ListTokenizer sink2 = new ListTokenizer(null);
          
          TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader1), sink1), sink2);
          TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader2), sink1), sink2);    
          
          // now sink1 and sink2 will both get tokens from both reader1 and reader2 after whitespace tokenizer
          // now we can further wrap any of these in extra analysis, and more "tees" can be inserted if desired.
          
          TokenStream final1 = new LowerCaseFilter(source1);
          TokenStream final2 = source2;
          TokenStream final3 = new EntityDetect(sink1);
          TokenStream final4 = new URLDetect(sink2);
              
          d.add(new Field("f1", final1));
          d.add(new Field("f2", final2));
          d.add(new Field("f3", final3));
          d.add(new Field("f4", final4));
          
          Show
          Yonik Seeley added a comment - I think having the "tee" solves the many-to-many case... you can have many fields contribute tokens to a new field. ListTokenizer sink1 = new ListTokenizer( null ); ListTokenizer sink2 = new ListTokenizer( null ); TokenStream source1 = new TeeTokenFilter( new TeeTokenFilter( new WhitespaceTokenizer(reader1), sink1), sink2); TokenStream source2 = new TeeTokenFilter( new TeeTokenFilter( new WhitespaceTokenizer(reader2), sink1), sink2); // now sink1 and sink2 will both get tokens from both reader1 and reader2 after whitespace tokenizer // now we can further wrap any of these in extra analysis, and more "tees" can be inserted if desired. TokenStream final1 = new LowerCaseFilter(source1); TokenStream final2 = source2; TokenStream final3 = new EntityDetect(sink1); TokenStream final4 = new URLDetect(sink2); d.add( new Field( "f1" , final1)); d.add( new Field( "f2" , final2)); d.add( new Field( "f3" , final3)); d.add( new Field( "f4" , final4));
          Hide
          Michael Busch added a comment -

          I like the TeeTokenFilter! +1

          Show
          Michael Busch added a comment - I like the TeeTokenFilter! +1
          Hide
          Grant Ingersoll added a comment -

          OK, looks good to me and is much simpler. Only thing that gets complicated is the constructors, but that should be manageable. Thanks for bearing w/ me

          One of you want to whip up a patch w/ tests or do you want me to do it?

          Show
          Grant Ingersoll added a comment - OK, looks good to me and is much simpler. Only thing that gets complicated is the constructors, but that should be manageable. Thanks for bearing w/ me One of you want to whip up a patch w/ tests or do you want me to do it?
          Hide
          Michael Busch added a comment -

          I'm quite busy currently with other stuff. Feel free to go ahead

          Show
          Michael Busch added a comment - I'm quite busy currently with other stuff. Feel free to go ahead
          Hide
          Grant Ingersoll added a comment -

          Will do. Patch to follow shortly

          Show
          Grant Ingersoll added a comment - Will do. Patch to follow shortly
          Hide
          Grant Ingersoll added a comment -

          Whew. I think we are there and I like it!

          I renamed Yonik's suggestions to be SinkTokenizer and SourceTokenFilter to model the whole source/sink notion. Hopefully people won't think the SourceTokenFilter is for processing code.

          I will commit tomorrow if there are no objections.

          Show
          Grant Ingersoll added a comment - Whew. I think we are there and I like it! I renamed Yonik's suggestions to be SinkTokenizer and SourceTokenFilter to model the whole source/sink notion. Hopefully people won't think the SourceTokenFilter is for processing code. I will commit tomorrow if there are no objections.
          Hide
          Yonik Seeley added a comment -

          The SinkTokenizer name could make sense, but I think TeeTokenFilter makes more sense than SourceTokenFilter (it is a tee, it splits a single token stream into two, just like the UNIX tee command).

          Show
          Yonik Seeley added a comment - The SinkTokenizer name could make sense, but I think TeeTokenFilter makes more sense than SourceTokenFilter (it is a tee, it splits a single token stream into two, just like the UNIX tee command).
          Hide
          Grant Ingersoll added a comment -

          Tee it is. And here I just thought you liked golf! I guess I have never used the tee command in UNIX.

          Show
          Grant Ingersoll added a comment - Tee it is. And here I just thought you liked golf! I guess I have never used the tee command in UNIX.
          Hide
          Grant Ingersoll added a comment -

          Committed revision 599478.

          Show
          Grant Ingersoll added a comment - Committed revision 599478.
          Hide
          Michael Busch added a comment -

          Sorry, this review is a bit late. Only a simple remark:

          In SinkTokenizer you could initalize the iterator in the constructor,
          then you can avoid the check if (iter==null) in next()?

          Show
          Michael Busch added a comment - Sorry, this review is a bit late. Only a simple remark: In SinkTokenizer you could initalize the iterator in the constructor, then you can avoid the check if (iter==null) in next()?
          Hide
          Yonik Seeley added a comment -

          > In SinkTokenizer you could initalize the iterator in the constructor

          Iterators are generally fail-fast, hence it would throw an exception when you tried to use it later after adding some elements.

          Show
          Yonik Seeley added a comment - > In SinkTokenizer you could initalize the iterator in the constructor Iterators are generally fail-fast, hence it would throw an exception when you tried to use it later after adding some elements.
          Hide
          Michael Busch added a comment -

          I see. Then we should set iter=null in add() in case after reset() more tokens
          are added to the list, right?

          Show
          Michael Busch added a comment - I see. Then we should set iter=null in add() in case after reset() more tokens are added to the list, right?
          Hide
          Grant Ingersoll added a comment -

          In looking again, I also wonder whether the getTokens() shouldn't return an immutable list? Or should we allow adding tokens outside of the Tee process?

          Show
          Grant Ingersoll added a comment - In looking again, I also wonder whether the getTokens() shouldn't return an immutable list? Or should we allow adding tokens outside of the Tee process?
          Hide
          Yonik Seeley added a comment -

          > I see. Then we should set iter=null in add() in case after reset() more tokens are added to the list, right?
          Depends on what we consider valid usecases... reset could also just null out iter.

          > Or should we allow adding tokens outside of the Tee process?
          Why not? Seems more flexible, and this is an expert level API.

          Show
          Yonik Seeley added a comment - > I see. Then we should set iter=null in add() in case after reset() more tokens are added to the list, right? Depends on what we consider valid usecases... reset could also just null out iter. > Or should we allow adding tokens outside of the Tee process? Why not? Seems more flexible, and this is an expert level API.
          Hide
          Grant Ingersoll added a comment -

          And, of course, if we are calling add() outside of the Tee process, we probably don't need to clone the token, either.

          Show
          Grant Ingersoll added a comment - And, of course, if we are calling add() outside of the Tee process, we probably don't need to clone the token, either.
          Hide
          Grant Ingersoll added a comment -

          Why not? Seems more flexible, and this is an expert level API.

          Then we should document that they must call reset before calling next(), right? Same could go for the add() method.

          Show
          Grant Ingersoll added a comment - Why not? Seems more flexible, and this is an expert level API. Then we should document that they must call reset before calling next(), right? Same could go for the add() method.
          Hide
          Grant Ingersoll added a comment -

          I also added, in SinkTokenizer, to override the close() method on Tokenizer so that it doesn't try to close the Reader, which throws an NullPointerEx.

          Show
          Grant Ingersoll added a comment - I also added, in SinkTokenizer, to override the close() method on Tokenizer so that it doesn't try to close the Reader, which throws an NullPointerEx.
          Hide
          Grant Ingersoll added a comment -

          http://www.gossamer-threads.com/lists/lucene/java-dev/56255

          Changed to override next() instead of next(Token)

          Show
          Grant Ingersoll added a comment - http://www.gossamer-threads.com/lists/lucene/java-dev/56255 Changed to override next() instead of next(Token)

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development