Lucene - Core
  1. Lucene - Core
  2. LUCENE-4642

Add create(AttributeFactory) to TokenizerFactory and subclasses with ctors taking AttributeFactory, and remove Tokenizer's and subclasses' ctors taking AttributeSource

    Details

    • Lucene Fields:
      New, Patch Available

      Description

      All tokenizer implementations have a constructor that takes a given AttributeSource as parameter (LUCENE-1826). These should be removed.

      TokenizerFactory does not provide an API to create tokenizers with a given AttributeFactory, but quite a few tokenizers have constructors that take an AttributeFactory. TokenizerFactory should add a create(AttributeFactory) method, as should subclasses for tokenizers with AttributeFactory accepting ctors.

      1. LUCENE-4642.patch
        35 kB
        Steve Rowe
      2. LUCENE-4642.patch
        34 kB
        Renaud Delbru
      3. LUCENE-4642.patch
        18 kB
        Steve Rowe
      4. LUCENE-4642.patch
        25 kB
        Renaud Delbru
      5. LUCENE-4642-single-create-method-on-TokenizerFactory-subclasses.patch
        32 kB
        Steve Rowe
      6. LUCENE-4642-single-create-method-on-TokenizerFactory-subclasses.patch
        33 kB
        Steve Rowe
      7. TrieTokenizerFactory.java.patch
        2 kB
        Uwe Schindler

        Activity

        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.
        Hide
        Renaud Delbru added a comment -

        Thanks for committing this, Steve and Robert. That's great.

        Show
        Renaud Delbru added a comment - Thanks for committing this, Steve and Robert. That's great.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Steven Rowe
        http://svn.apache.org/viewvc?view=revision&revision=1456771

        LUCENE-4642:

        • TokenizerFactory.create(Reader) is made final, and calls the AttributeFactory-accepting version with AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY
        • TokenizerFactory.create(AttributeFactory, Reader) is made abstract
        • Added AttributeFactory-accepting constructors to all Tokenizer's with existing TokenizerFactory subclasses that didn't already have them
        • Removed create(Reader) from all TokenizerFactory subclasses
          (merged trunk r1456768)
        Show
        Commit Tag Bot added a comment - [branch_4x commit] Steven Rowe http://svn.apache.org/viewvc?view=revision&revision=1456771 LUCENE-4642 : TokenizerFactory.create(Reader) is made final, and calls the AttributeFactory-accepting version with AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY TokenizerFactory.create(AttributeFactory, Reader) is made abstract Added AttributeFactory-accepting constructors to all Tokenizer's with existing TokenizerFactory subclasses that didn't already have them Removed create(Reader) from all TokenizerFactory subclasses (merged trunk r1456768)
        Hide
        Steve Rowe added a comment -

        Committed the second patch to trunk and branch_4x.

        Thanks for your help, Robert.

        Show
        Steve Rowe added a comment - Committed the second patch to trunk and branch_4x. Thanks for your help, Robert.
        Hide
        Robert Muir added a comment -

        Thanks a lot.. this is very nice IMO.

        Show
        Robert Muir added a comment - Thanks a lot.. this is very nice IMO.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Steven Rowe
        http://svn.apache.org/viewvc?view=revision&revision=1456768

        LUCENE-4642:

        • TokenizerFactory.create(Reader) is made final, and calls the AttributeFactory-accepting version with AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY
        • TokenizerFactory.create(AttributeFactory, Reader) is made abstract
        • Added AttributeFactory-accepting constructors to all Tokenizer's with existing TokenizerFactory subclasses that didn't already have them
        • Removed create(Reader) from all TokenizerFactory subclasses
        Show
        Commit Tag Bot added a comment - [trunk commit] Steven Rowe http://svn.apache.org/viewvc?view=revision&revision=1456768 LUCENE-4642 : TokenizerFactory.create(Reader) is made final, and calls the AttributeFactory-accepting version with AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY TokenizerFactory.create(AttributeFactory, Reader) is made abstract Added AttributeFactory-accepting constructors to all Tokenizer's with existing TokenizerFactory subclasses that didn't already have them Removed create(Reader) from all TokenizerFactory subclasses
        Hide
        Steve Rowe added a comment -

        In this patch there is a new even more horrible hack in TrieTokenizer(Factory) - the AttributeFactory argument to the TrieTokenizer constructor is ignored!!! Surely there a better way???:

        Just at a quick glance, I think it should be passed to TrieTokenizer.getNumericTokenStream(int), which should instead take (AttFactory, int), and call: public NumericTokenStream(AttributeFactory factory, final int precisionStep) ?
        This way the numerictokenstream is created with the requested attfactory. TrieTokenizer shouldnt take factory at all since its just this 'layer on top' that pretends to be a tokenizer over it?

        Cool, that works, thanks; patch attached that does this.

        Committing shortly.

        Show
        Steve Rowe added a comment - In this patch there is a new even more horrible hack in TrieTokenizer(Factory) - the AttributeFactory argument to the TrieTokenizer constructor is ignored!!! Surely there a better way???: Just at a quick glance, I think it should be passed to TrieTokenizer.getNumericTokenStream(int), which should instead take (AttFactory, int), and call: public NumericTokenStream(AttributeFactory factory, final int precisionStep) ? This way the numerictokenstream is created with the requested attfactory. TrieTokenizer shouldnt take factory at all since its just this 'layer on top' that pretends to be a tokenizer over it? Cool, that works, thanks; patch attached that does this. Committing shortly.
        Hide
        Robert Muir added a comment -

        In this patch there is a new even more horrible hack in TrieTokenizer(Factory) - the AttributeFactory argument to the TrieTokenizer constructor is ignored!!! Surely there a better way???:

        Just at a quick glance, I think it should be passed to TrieTokenizer.getNumericTokenStream(int), which should instead take (AttFactory, int), and call: public NumericTokenStream(AttributeFactory factory, final int precisionStep) ?

        This way the numerictokenstream is created with the requested attfactory. TrieTokenizer shouldnt take factory at all since its just this 'layer on top' that pretends to be a tokenizer over it?

        Show
        Robert Muir added a comment - In this patch there is a new even more horrible hack in TrieTokenizer(Factory) - the AttributeFactory argument to the TrieTokenizer constructor is ignored!!! Surely there a better way???: Just at a quick glance, I think it should be passed to TrieTokenizer.getNumericTokenStream(int), which should instead take (AttFactory, int), and call: public NumericTokenStream(AttributeFactory factory, final int precisionStep) ? This way the numerictokenstream is created with the requested attfactory. TrieTokenizer shouldnt take factory at all since its just this 'layer on top' that pretends to be a tokenizer over it?
        Hide
        Steve Rowe added a comment - - edited

        Patch:

        • TokenizerFactory.create(Reader) is made final, and calls the AttributeFactory-accepting version with AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY
        • TokenizerFactory.create(AttributeFactory, Reader) is made abstract
        • Added AttributeFactory-accepting constructors to all Tokenizer's with existing TokenizerFactory subclasses that didn't already have them
        • Removed create(Reader) from all TokenizerFactory subclasses.

        In this patch there is a new even more horrible hack in TrieTokenizer(Factory) - the AttributeFactory argument to the TrieTokenizer constructor is ignored!!! Surely there a better way???:

        public class TrieTokenizerFactory extends TokenizerFactory {
        ...
          @Override
          public TrieTokenizer create(AttributeFactory factory, Reader input) {
            return new TrieTokenizer(factory, input, type, TrieTokenizer.getNumericTokenStream(precisionStep));
          }
        }
        
        final class TrieTokenizer extends Tokenizer {
        ...
          public TrieTokenizer(Reader input, TrieTypes type, final NumericTokenStream ts) {
            this(AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, input, type, ts);
          }
        
          public TrieTokenizer(AttributeFactory factory, Reader input, TrieTypes type, final NumericTokenStream ts) {
            // Hack #0: factory param is ignored
            // Häckidy-Hick-Hack #1: must share the attributes with the NumericTokenStream we delegate to, so we create a fake factory:
            super(new AttributeFactory() {
              @Override
              public AttributeImpl createAttributeInstance(Class<? extends Attribute> attClass) {
                return (AttributeImpl) ts.addAttribute(attClass);
              }
            }, input);
            // add all attributes:
            for (Iterator<Class<? extends Attribute>> it = ts.getAttributeClassesIterator(); it.hasNext();) {
              addAttribute(it.next());
            }
            this.type = type;
            this.ts = ts;
            // dates tend to be longer, especially when math is involved
            termAtt.resizeBuffer( type == TrieTypes.DATE ? 128 : 32 );
          }
        
        Show
        Steve Rowe added a comment - - edited Patch: TokenizerFactory.create(Reader) is made final, and calls the AttributeFactory -accepting version with AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY TokenizerFactory.create(AttributeFactory, Reader) is made abstract Added AttributeFactory -accepting constructors to all Tokenizer 's with existing TokenizerFactory subclasses that didn't already have them Removed create(Reader) from all TokenizerFactory subclasses. In this patch there is a new even more horrible hack in TrieTokenizer(Factory) - the AttributeFactory argument to the TrieTokenizer constructor is ignored !!! Surely there a better way???: public class TrieTokenizerFactory extends TokenizerFactory { ... @Override public TrieTokenizer create(AttributeFactory factory, Reader input) { return new TrieTokenizer(factory, input, type, TrieTokenizer.getNumericTokenStream(precisionStep)); } } final class TrieTokenizer extends Tokenizer { ... public TrieTokenizer(Reader input, TrieTypes type, final NumericTokenStream ts) { this (AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, input, type, ts); } public TrieTokenizer(AttributeFactory factory, Reader input, TrieTypes type, final NumericTokenStream ts) { // Hack #0: factory param is ignored // Häckidy-Hick-Hack #1: must share the attributes with the NumericTokenStream we delegate to, so we create a fake factory: super ( new AttributeFactory() { @Override public AttributeImpl createAttributeInstance( Class <? extends Attribute> attClass) { return (AttributeImpl) ts.addAttribute(attClass); } }, input); // add all attributes: for (Iterator< Class <? extends Attribute>> it = ts.getAttributeClassesIterator(); it.hasNext();) { addAttribute(it.next()); } this .type = type; this .ts = ts; // dates tend to be longer, especially when math is involved termAtt.resizeBuffer( type == TrieTypes.DATE ? 128 : 32 ); }
        Hide
        Steve Rowe added a comment -

        and in the base tokenizerfactory have

          /** just for ease of use... maybe @deprecated, maybe don't even have this method */
          public final Tokenizer create(Reader input) {
             return create(DEFAULT_ATTRIBUTE_FACTORY, input);
          }
        

        Okay, I understand, I'll make a patch - I plan on including the no-attr-factory create() variant, and not deprecating it - I assume this will be used by most consumers.

        Show
        Steve Rowe added a comment - and in the base tokenizerfactory have /** just for ease of use... maybe @deprecated, maybe don't even have this method */ public final Tokenizer create(Reader input) { return create(DEFAULT_ATTRIBUTE_FACTORY, input); } Okay, I understand, I'll make a patch - I plan on including the no-attr-factory create() variant, and not deprecating it - I assume this will be used by most consumers.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Steven Rowe
        http://svn.apache.org/viewvc?view=revision&revision=1455670

        LUCENE-4642: Add create(AttributeFactory) to TokenizerFactory and subclasses with ctors taking AttributeFactory, and remove Tokenizer's and subclasses' ctors taking AttributeSource.

        Show
        Commit Tag Bot added a comment - [trunk commit] Steven Rowe http://svn.apache.org/viewvc?view=revision&revision=1455670 LUCENE-4642 : Add create(AttributeFactory) to TokenizerFactory and subclasses with ctors taking AttributeFactory, and remove Tokenizer's and subclasses' ctors taking AttributeSource.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Steven Rowe
        http://svn.apache.org/viewvc?view=revision&revision=1455698

        LUCENE-4642: Add create(AttributeFactory) to TokenizerFactory and subclasses with ctors taking AttributeFactory, and remove Tokenizer's and subclasses' ctors taking AttributeSource. (merge trunk r1455670)

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Steven Rowe http://svn.apache.org/viewvc?view=revision&revision=1455698 LUCENE-4642 : Add create(AttributeFactory) to TokenizerFactory and subclasses with ctors taking AttributeFactory, and remove Tokenizer's and subclasses' ctors taking AttributeSource. (merge trunk r1455670)
        Hide
        Robert Muir added a comment -

        Its not a constructor explosion argument, because we aren't dealing with a constructor.

        Its just not a good API currently: it means every factory has to implement create() twice.

        Whats the sense in:

          /** Creates the {@link TokenStream} of n-grams from the given {@link Reader}. */
          @Override
          public NGramTokenizer create(Reader input) {
            return new NGramTokenizer(input, minGramSize, maxGramSize);
          }
        
          @Override
          public NGramTokenizer create(AttributeFactory factory, Reader input) {
            return new NGramTokenizer(factory, input, minGramSize, maxGramSize);
          }
        

        when you can just have

          @Override
          public NGramTokenizer create(AttributeFactory factory, Reader input) {
            return new NGramTokenizer(factory, input, minGramSize, maxGramSize);
          }
        

        and in the base tokenizerfactory have

          /** just for ease of use... maybe @deprecated, maybe don't even have this method */
          public final Tokenizer create(Reader input) {
             return create(DEFAULT_ATTRIBUTE_FACTORY, input);
          }
        
        Show
        Robert Muir added a comment - Its not a constructor explosion argument, because we aren't dealing with a constructor. Its just not a good API currently: it means every factory has to implement create() twice. Whats the sense in: /** Creates the {@link TokenStream} of n-grams from the given {@link Reader}. */ @Override public NGramTokenizer create(Reader input) { return new NGramTokenizer(input, minGramSize, maxGramSize); } @Override public NGramTokenizer create(AttributeFactory factory, Reader input) { return new NGramTokenizer(factory, input, minGramSize, maxGramSize); } when you can just have @Override public NGramTokenizer create(AttributeFactory factory, Reader input) { return new NGramTokenizer(factory, input, minGramSize, maxGramSize); } and in the base tokenizerfactory have /** just for ease of use... maybe @deprecated, maybe don't even have this method */ public final Tokenizer create(Reader input) { return create(DEFAULT_ATTRIBUTE_FACTORY, input); }
        Hide
        Steve Rowe added a comment -

        Somewhere else? You mean, everywhere it's used?

        I really don't like separate methods. This will cause bugs and confusion.

        Can you expand on this thought? Is this the constructor explosion argument? If not, what sort of bugs and confusion do you see happening?

        Show
        Steve Rowe added a comment - Somewhere else? You mean, everywhere it's used? I really don't like separate methods. This will cause bugs and confusion. Can you expand on this thought? Is this the constructor explosion argument? If not, what sort of bugs and confusion do you see happening?
        Hide
        Robert Muir added a comment -

        I think the way forward is to have one create() method for tokenizers. if it should take attributefactory, then thats what it takes (and somewhere else, DEFAULT_ATTRIBUTE_FACTORY is filled in).

        I really don't like separate methods. This will cause bugs and confusion.

        Show
        Robert Muir added a comment - I think the way forward is to have one create() method for tokenizers. if it should take attributefactory, then thats what it takes (and somewhere else, DEFAULT_ATTRIBUTE_FACTORY is filled in). I really don't like separate methods. This will cause bugs and confusion.
        Hide
        Steve Rowe added a comment - - edited

        I haven't resolved this issue because of Robert's response to Renaud:

        in order not to break existing factories, I think it would be better to make this new method throw UOE by default instead of being abstract.

        I don't agree with this for trunk. we should add deprecations or whatever in 4.x, but trunk should be clean without any UOEs.

        TokenizerFactory.create(AttributeFactory,Reader) throws UOE, for those subclasses that don't override because the corresponding Tokenizer doesn't have a constructor that takes an AttributeFactory.

        I guess the way forward (trunk and branch_4x, I think) is to make this method abstract and add implementations to all subclasses that don't already have one, after first adding AttributeFactory-accepting constructors to all Tokenizers that don't already have them?

        Show
        Steve Rowe added a comment - - edited I haven't resolved this issue because of Robert's response to Renaud: in order not to break existing factories, I think it would be better to make this new method throw UOE by default instead of being abstract. I don't agree with this for trunk. we should add deprecations or whatever in 4.x, but trunk should be clean without any UOEs. TokenizerFactory.create(AttributeFactory,Reader) throws UOE, for those subclasses that don't override because the corresponding Tokenizer doesn't have a constructor that takes an AttributeFactory. I guess the way forward (trunk and branch_4x, I think) is to make this method abstract and add implementations to all subclasses that don't already have one, after first adding AttributeFactory-accepting constructors to all Tokenizers that don't already have them?
        Hide
        Steve Rowe added a comment -

        Thanks Renaud for being persistent

        Show
        Steve Rowe added a comment - Thanks Renaud for being persistent
        Hide
        Steve Rowe added a comment -

        Committed to trunk r1455670 and branch_4x r1455698

        Show
        Steve Rowe added a comment - Committed to trunk r1455670 and branch_4x r1455698
        Hide
        Steve Rowe added a comment -

        Patch, narrows one or two more create(AttributeFactory) return types, minor cosmetic mods, removed unused imports.

        Committing shortly.

        Show
        Steve Rowe added a comment - Patch, narrows one or two more create(AttributeFactory) return types, minor cosmetic mods, removed unused imports. Committing shortly.
        Hide
        Renaud Delbru added a comment -

        Hi Steve, I imagine things were busy these past days with the 4.2 release. Would you need help to finalise this patch ? thanks.

        Show
        Renaud Delbru added a comment - Hi Steve, I imagine things were busy these past days with the 4.2 release. Would you need help to finalise this patch ? thanks.
        Hide
        Steve Rowe added a comment -

        Hi Renaud,

        I skimmed your patch, looks good, I'll take a closer look in the next couple days for completeness and testing.

        Show
        Steve Rowe added a comment - Hi Renaud, I skimmed your patch, looks good, I'll take a closer look in the next couple days for completeness and testing.
        Hide
        Renaud Delbru added a comment -

        Hi, any updates about the patch ? thanks.

        Show
        Renaud Delbru added a comment - Hi, any updates about the patch ? thanks.
        Hide
        Renaud Delbru added a comment -

        Hi, would this patch be considered for inclusion at some point in time ? Thanks.

        Show
        Renaud Delbru added a comment - Hi, would this patch be considered for inclusion at some point in time ? Thanks.
        Hide
        Renaud Delbru added a comment -

        Hi,

        I have submitted a patch which integrates:

        • the patch from Uwe
        • the removal of the Tokenizer(AttributeSource) constructor
        • the addition of a TokenizerFactory.create(AttributeFactory) method
        • some of the changes from the previous patch from Steve (e.g., TokenizerFactory.create method throw UOE by default)

        All test suites are passing.

        Show
        Renaud Delbru added a comment - Hi, I have submitted a patch which integrates: the patch from Uwe the removal of the Tokenizer(AttributeSource) constructor the addition of a TokenizerFactory.create(AttributeFactory) method some of the changes from the previous patch from Steve (e.g., TokenizerFactory.create method throw UOE by default) All test suites are passing.
        Hide
        Renaud Delbru added a comment -

        Great, I think that AttributeFactory hack could work for us. Would you agree to add a TokenizerFactory.create(AttributeFactory) method ? I could prepare a patch for that.

        Show
        Renaud Delbru added a comment - Great, I think that AttributeFactory hack could work for us. Would you agree to add a TokenizerFactory.create(AttributeFactory) method ? I could prepare a patch for that.
        Hide
        Robert Muir added a comment -

        +1 for Uwe's patch. I think this constructor is dangerous, i dont want it on every tokenizer.

        Show
        Robert Muir added a comment - +1 for Uwe's patch. I think this constructor is dangerous, i dont want it on every tokenizer.
        Hide
        Uwe Schindler added a comment -

        Here the patch that changes the hack so we can remove the incorrect AttributeSource from TokenStream/Tokenizer ctors. AttributeFactory is the way to go and if somebody relied on this odd expert behaviour, he can port the code to use the same hack.

        Show
        Uwe Schindler added a comment - Here the patch that changes the hack so we can remove the incorrect AttributeSource from TokenStream/Tokenizer ctors. AttributeFactory is the way to go and if somebody relied on this odd expert behaviour, he can port the code to use the same hack.
        Hide
        Uwe Schindler added a comment - - edited

        And I guess I was secretly hoping we could remove Tokenizer(AttributeSource) if we fixed the solr hack.

        This is my opinion, too!

        To remove the hack I have an idea (but it is also a hack). The main problem is Solr, which cannot work with plain TokenStreams, it always needs a Tokenizer (which is a serious limitation for special field types like numerics). The better hack I have is to write a fake AttributeFactory, that simply returns the attribute implementations of the underlying NumericTokenStream. I will attach a patch. Then we can remove new Tokenizer(AttributeSource), which is horrible and incorrect.

        Show
        Uwe Schindler added a comment - - edited And I guess I was secretly hoping we could remove Tokenizer(AttributeSource) if we fixed the solr hack. This is my opinion, too! To remove the hack I have an idea (but it is also a hack). The main problem is Solr, which cannot work with plain TokenStreams, it always needs a Tokenizer (which is a serious limitation for special field types like numerics). The better hack I have is to write a fake AttributeFactory, that simply returns the attribute implementations of the underlying NumericTokenStream. I will attach a patch. Then we can remove new Tokenizer(AttributeSource), which is horrible and incorrect.
        Hide
        Uwe Schindler added a comment -

        TokenStreams are final and their settings should not be modifiable (the ones which still have setters are there for backwards compatibility in Lucene 3.x, in 4.0 all settings should be unmodifiable). It is also impossible to change the AttributeFactory or AttributeSource after construction because the attributes are created during construction (addAttribute in the implicit field initialization constructor), so changing the AttributeSource/Factory afterwards will not work.

        Show
        Uwe Schindler added a comment - TokenStreams are final and their settings should not be modifiable (the ones which still have setters are there for backwards compatibility in Lucene 3.x, in 4.0 all settings should be unmodifiable). It is also impossible to change the AttributeFactory or AttributeSource after construction because the attributes are created during construction (addAttribute in the implicit field initialization constructor), so changing the AttributeSource/Factory afterwards will not work.
        Hide
        Renaud Delbru added a comment -

        Hi Robert,

        I understand your point of view. One possible alternative for simplifying the API would be to refactor constructors with AttributeSource/AttributeFactory into setters. After a quick look, this looks compatible with the existing tokenizers and tokenizer factories.
        The setting of AttributeSource/AttributeFactory for a tokenizer will be transparent (i.e., they do not have to explicitly create a constructor), and specific extension can be still implemented by subclasses (e.g., NumericTokenStream can overwrite the setAttributeFactory method to wrap a given factory with NumericAttributeFactory).
        For the tokenizer factories, we can then implement a create method with an AttributeSource/AttributeFactory parameter, which will call the abstract method create and then call the setAttributeSource/setAttributeFactory on the newly created tokenizer.

        What do you think ? Did I miss something in my reasoning which could break something ?

        Show
        Renaud Delbru added a comment - Hi Robert, I understand your point of view. One possible alternative for simplifying the API would be to refactor constructors with AttributeSource/AttributeFactory into setters. After a quick look, this looks compatible with the existing tokenizers and tokenizer factories. The setting of AttributeSource/AttributeFactory for a tokenizer will be transparent (i.e., they do not have to explicitly create a constructor), and specific extension can be still implemented by subclasses (e.g., NumericTokenStream can overwrite the setAttributeFactory method to wrap a given factory with NumericAttributeFactory). For the tokenizer factories, we can then implement a create method with an AttributeSource/AttributeFactory parameter, which will call the abstract method create and then call the setAttributeSource/setAttributeFactory on the newly created tokenizer. What do you think ? Did I miss something in my reasoning which could break something ?
        Hide
        Robert Muir added a comment -

        My problem i guess with AttributeSource/AttributeFactory is that they invade on every single custom tokenizer: the API is not good.

        I realize its useful for expert users to be able to plug in their own, but why in the world must every tokenizer have ctor explosion (minimum 3) to support this?

        And I guess I was secretly hoping we could remove Tokenizer(AttributeSource) if we fixed the solr hack.

        Again my main problem is not about what you want to do, its instead related to the existing APIs (Tokenizer.java) and where we are heading if we perpetuate this to the analysis factories (TokenizerFactory) too.

        Show
        Robert Muir added a comment - My problem i guess with AttributeSource/AttributeFactory is that they invade on every single custom tokenizer: the API is not good. I realize its useful for expert users to be able to plug in their own, but why in the world must every tokenizer have ctor explosion (minimum 3) to support this? And I guess I was secretly hoping we could remove Tokenizer(AttributeSource) if we fixed the solr hack. Again my main problem is not about what you want to do, its instead related to the existing APIs (Tokenizer.java) and where we are heading if we perpetuate this to the analysis factories (TokenizerFactory) too.
        Hide
        Renaud Delbru added a comment -

        @steve:

        have you looked at TeeSinkTokenFilter

        Yes, and from my current understanding, it is similar to our current implementation. The problem with this approach is that the exchange of attributes is performed using the AttributeSource.State API with AttributeSource#captureState and AttributeSource#restoreState, which copies the values of all attribute implementations that the state contains, and this is very inefficient as it has to copies arrays and other objects (e.g., char term arrays, etc.) for every single token.

        @robert:

        Concerning the problem of UOEs, the new patch of Steve reduces the number of UOEs to one only, which is much more reasonable than my first approach. I have looked at the current state of the Lucene trunk, and there are already a lot of UOEs in many places. So, I would suggest that this problem may not be a blocking one (but I might be wrong).

        Concerning the problem of constructor explosion, maybe we can find a consensus. Your proposition of removing Tokenizer(AttributeSource) cannot work for us, as we need it to share a same AttributeSource across multiple streams. However, as I proposed, removing the Tokenizer(AttributeFactory) could work as it could be emulated by using Tokenizer(AttributeSource).

        Show
        Renaud Delbru added a comment - @steve: have you looked at TeeSinkTokenFilter Yes, and from my current understanding, it is similar to our current implementation. The problem with this approach is that the exchange of attributes is performed using the AttributeSource.State API with AttributeSource#captureState and AttributeSource#restoreState, which copies the values of all attribute implementations that the state contains, and this is very inefficient as it has to copies arrays and other objects (e.g., char term arrays, etc.) for every single token. @robert: Concerning the problem of UOEs, the new patch of Steve reduces the number of UOEs to one only, which is much more reasonable than my first approach. I have looked at the current state of the Lucene trunk, and there are already a lot of UOEs in many places. So, I would suggest that this problem may not be a blocking one (but I might be wrong). Concerning the problem of constructor explosion, maybe we can find a consensus. Your proposition of removing Tokenizer(AttributeSource) cannot work for us, as we need it to share a same AttributeSource across multiple streams. However, as I proposed, removing the Tokenizer(AttributeFactory) could work as it could be emulated by using Tokenizer(AttributeSource).
        Hide
        Steve Rowe added a comment -

        Renaud, have you looked at TeeSinkTokenFilter? Sounds to me like a good fit for the use case you mentioned.

        Show
        Steve Rowe added a comment - Renaud, have you looked at TeeSinkTokenFilter ? Sounds to me like a good fit for the use case you mentioned.
        Hide
        Robert Muir added a comment -

        I raised a lot of questions. I think they are valid concerns.

        Show
        Robert Muir added a comment - I raised a lot of questions. I think they are valid concerns.
        Hide
        Renaud Delbru added a comment -

        Hi,

        are there still some open questions on this issue that block the patch of being committed ?

        Show
        Renaud Delbru added a comment - Hi, are there still some open questions on this issue that block the patch of being committed ?
        Hide
        Renaud Delbru added a comment -

        Because its totally unrelated.

        Well, I think the user could simply create a new AttributeSource with a given AttributeFactory to emulate the Tokenizer(AttributeFactory) ? But that might add some burden on the user side.

        Show
        Renaud Delbru added a comment - Because its totally unrelated. Well, I think the user could simply create a new AttributeSource with a given AttributeFactory to emulate the Tokenizer(AttributeFactory) ? But that might add some burden on the user side.
        Hide
        Robert Muir added a comment -

        Why not the contrary instead ? I.e., remove Tokenizer(AttributeFactory) and leave Tokenizer(AttributeSource) since AttributeFactory is an enclosed class of AttributeSource ? Limiting the API to only AttributeFactory will restrict it unnecessarily imho.

        Because its totally unrelated.

        AttributeFactory lets you customize the attribute implementations.

        But the AttributeSource ctor is a even crazier thing: its sharing actual attributes objects with another attributesource.

        Show
        Robert Muir added a comment - Why not the contrary instead ? I.e., remove Tokenizer(AttributeFactory) and leave Tokenizer(AttributeSource) since AttributeFactory is an enclosed class of AttributeSource ? Limiting the API to only AttributeFactory will restrict it unnecessarily imho. Because its totally unrelated. AttributeFactory lets you customize the attribute implementations. But the AttributeSource ctor is a even crazier thing: its sharing actual attributes objects with another attributesource.
        Hide
        Renaud Delbru added a comment -

        Personally: I think we should remove Tokenizer(AttributeSource): it bloats the APIs and causes ctor explosion.

        Why not the contrary instead ? I.e., remove Tokenizer(AttributeFactory) and leave Tokenizer(AttributeSource) since AttributeFactory is an enclosed class of AttributeSource ? Limiting the API to only AttributeFactory will restrict it unnecessarily imho.

        Our use case is to be able to create "advanced token streams", where one "parent token stream" can have multiple "child token streams", the parent token stream will share their attribute sources with the child token streams for performance reasons. Emulating this behaviour by doing copies of the attributes from stream to stream is really ineffective (our throughput is divided by at least 3).
        A more concrete use case is the ability to create "specific token streams" for a particular "token type". For example, our parent tokenizer tokenizes a string into a list of tokens, each one having a specific type. Then, each token is processed downstream by "child token streams". The child token stream that will process the token depends on the token type attribute.

        Show
        Renaud Delbru added a comment - Personally: I think we should remove Tokenizer(AttributeSource): it bloats the APIs and causes ctor explosion. Why not the contrary instead ? I.e., remove Tokenizer(AttributeFactory) and leave Tokenizer(AttributeSource) since AttributeFactory is an enclosed class of AttributeSource ? Limiting the API to only AttributeFactory will restrict it unnecessarily imho. Our use case is to be able to create "advanced token streams", where one "parent token stream" can have multiple "child token streams", the parent token stream will share their attribute sources with the child token streams for performance reasons. Emulating this behaviour by doing copies of the attributes from stream to stream is really ineffective (our throughput is divided by at least 3). A more concrete use case is the ability to create "specific token streams" for a particular "token type". For example, our parent tokenizer tokenizes a string into a list of tokens, each one having a specific type. Then, each token is processed downstream by "child token streams". The child token stream that will process the token depends on the token type attribute.
        Hide
        Robert Muir added a comment -

        I'd also like to know the use case here.

        Personally: I think we should remove Tokenizer(AttributeSource): it bloats the APIs and causes ctor explosion.
        There are no real use-cases in the lucene/solr codebase: its only used by a HACK (TrieTokenizerFactory in Solr), which should instead be fixed.

        AttributeFactory on the other hand is different (e.g. real use cases like numerics and collation)

        Show
        Robert Muir added a comment - I'd also like to know the use case here. Personally: I think we should remove Tokenizer(AttributeSource): it bloats the APIs and causes ctor explosion. There are no real use-cases in the lucene/solr codebase: its only used by a HACK (TrieTokenizerFactory in Solr), which should instead be fixed. AttributeFactory on the other hand is different (e.g. real use cases like numerics and collation)
        Hide
        Robert Muir added a comment -

        don't you think adding the AttributeFactory ctor would be more useful? I think its much more esoteric to provide an AttributeSource to a tokenizer.

        in order not to break existing factories, I think it would be better to make this new method throw UOE by default instead of being abstract.

        I don't agree with this for trunk. we should add deprecations or whatever in 4.x, but trunk should be clean without any UOEs.

        Show
        Robert Muir added a comment - don't you think adding the AttributeFactory ctor would be more useful? I think its much more esoteric to provide an AttributeSource to a tokenizer. in order not to break existing factories, I think it would be better to make this new method throw UOE by default instead of being abstract. I don't agree with this for trunk. we should add deprecations or whatever in 4.x, but trunk should be clean without any UOEs.
        Hide
        Steve Rowe added a comment -

        I think this is a good addition.

        Here is a patch against trunk. Four tokenizers and their factories have been removed, and so aren't present in this patch: ChineseTokenizer, ArabicLetterTokenizer, RussianLetterTokenizer, and CJKTokenizer.

        I've also taken the opportunity to narrow the return type to the class being instantiated from all analysis factory create() methods where possible.

        in order not to break existing factories, I think it would be better to make this new method throw UOE by default instead of being abstract.

        I agree - I did this in the patch.

        Show
        Steve Rowe added a comment - I think this is a good addition. Here is a patch against trunk. Four tokenizers and their factories have been removed, and so aren't present in this patch: ChineseTokenizer , ArabicLetterTokenizer , RussianLetterTokenizer , and CJKTokenizer . I've also taken the opportunity to narrow the return type to the class being instantiated from all analysis factory create() methods where possible. in order not to break existing factories, I think it would be better to make this new method throw UOE by default instead of being abstract. I agree - I did this in the patch.
        Hide
        Adrien Grand added a comment -

        I'm not familiar enough with Lucene anlysis to know whether it should be exposed in the factories, but in order not to break existing factories, I think it would be better to make this new method throw UOE by default instead of being abstract.

        Show
        Adrien Grand added a comment - I'm not familiar enough with Lucene anlysis to know whether it should be exposed in the factories, but in order not to break existing factories, I think it would be better to make this new method throw UOE by default instead of being abstract.
        Hide
        Steve Rowe added a comment -

        Hi Renaud, I'll review some time in the next week.

        Show
        Steve Rowe added a comment - Hi Renaud, I'll review some time in the next week.
        Hide
        Renaud Delbru added a comment -

        Could someone from the team tell us if this patch may be considered for inclusion at some point ? We currently need it in our project, and therefore it is kind of blocking us in our development. Thanks.

        Show
        Renaud Delbru added a comment - Could someone from the team tell us if this patch may be considered for inclusion at some point ? We currently need it in our project, and therefore it is kind of blocking us in our development. Thanks.
        Hide
        Renaud Delbru added a comment -

        Hi,

        Any plan to commit this patch ? Or is there additional work to do before ?

        thanks

        Show
        Renaud Delbru added a comment - Hi, Any plan to commit this patch ? Or is there additional work to do before ? thanks
        Hide
        Renaud Delbru added a comment -

        Patch adding #create(AttributeSource source, Reader reader) to the TokenizerFactory class and to all its subclasses.

        Given there are a lot of tokenizers that do not have constructors that take a given AttributeSource, I have implemented the new create method for their respective factory which throws a UnsupportedOperationException.

        Show
        Renaud Delbru added a comment - Patch adding #create(AttributeSource source, Reader reader) to the TokenizerFactory class and to all its subclasses. Given there are a lot of tokenizers that do not have constructors that take a given AttributeSource, I have implemented the new create method for their respective factory which throws a UnsupportedOperationException.

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Renaud Delbru
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development