Lucene - Core
  1. Lucene - Core
  2. LUCENE-1826

All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I have a TokenStream implementation that joins together multiple sub TokenStreams (i then do additional filtering on top of this, so i can't just have the indexer do the merging)

      in 2.4, this worked fine.
      once one sub stream was exhausted, i just started using the next stream

      however, in 2.9, this is very difficult, and requires copying Term buffers for every token being aggregated

      however, if all the sub TokenStreams share the same AttributeSource, and my "concat" TokenStream shares the same AttributeSource, this goes back to being very simple (and very efficient)

      So for example, i would like to see the following constructor added to StandardTokenizer:

        public StandardTokenizer(AttributeSource source, Reader input, boolean replaceInvalidAcronym) {
          super(source);
          ...
        }
      

      would likewise want similar constructors added to all Tokenizer sub classes provided by lucene

      1. lucene-1826.patch
        26 kB
        Michael Busch

        Activity

        Tim Smith created issue -
        Hide
        Mark Miller added a comment -

        To make sure these issues are considered for 2.9 (and there is not much time for this - but we will certainly take advantage of your experiences upgrading here), you want to set the fix to 2.9 rather than the affects.

        Show
        Mark Miller added a comment - To make sure these issues are considered for 2.9 (and there is not much time for this - but we will certainly take advantage of your experiences upgrading here), you want to set the fix to 2.9 rather than the affects.
        Hide
        Tim Smith added a comment -

        i'll do that from now on (feel free to boot them if you feel necessary (didn't want to overstep my bounds suggesting fix in 2.9))

        Show
        Tim Smith added a comment - i'll do that from now on (feel free to boot them if you feel necessary (didn't want to overstep my bounds suggesting fix in 2.9))
        Tim Smith made changes -
        Field Original Value New Value
        Fix Version/s 2.9 [ 12312682 ]
        Hide
        Tim Smith added a comment -

        NOTE: for me, this is just a "nice to have"

        I currently only use my "concat" TokenStream on my own TokenStream implementations right now (so i can do this manually on my own TokenStream Impls)

        however i would like to be able to directly use lucene "Tokenizers" under my "concat" TokenStream under some situations in the future

        Show
        Tim Smith added a comment - NOTE: for me, this is just a "nice to have" I currently only use my "concat" TokenStream on my own TokenStream implementations right now (so i can do this manually on my own TokenStream Impls) however i would like to be able to directly use lucene "Tokenizers" under my "concat" TokenStream under some situations in the future
        Hide
        Mark Miller added a comment -

        didn't want to overstep my bounds suggesting fix in 2.9

        No worries - I think its good because it will force us to consider them before deciding to do something or move them out.

        Show
        Mark Miller added a comment - didn't want to overstep my bounds suggesting fix in 2.9 No worries - I think its good because it will force us to consider them before deciding to do something or move them out.
        Hide
        Tim Smith added a comment -

        This is further complicated by the fact that Tokenizers are often "held onto" in a thread local

        so, Tokenizer.reset(Reader) should also take an AttributeSource in order to really reset things properly

        also, then all TokenFilters/TokenStreams would be required to reinit their held onto "attributes" at reset() time, not at constructor time otherwise they could be holding onto stale attributes

        Show
        Tim Smith added a comment - This is further complicated by the fact that Tokenizers are often "held onto" in a thread local so, Tokenizer.reset(Reader) should also take an AttributeSource in order to really reset things properly also, then all TokenFilters/TokenStreams would be required to reinit their held onto "attributes" at reset() time, not at constructor time otherwise they could be holding onto stale attributes
        Hide
        Uwe Schindler added a comment -

        I was thinking about that, too. But it was too much work in my opinion – But for the same reason like LUCENE-1804, they should have this ctor (and the AttributeFactory one!).

        Show
        Uwe Schindler added a comment - I was thinking about that, too. But it was too much work in my opinion – But for the same reason like LUCENE-1804 , they should have this ctor (and the AttributeFactory one!).
        Uwe Schindler made changes -
        Summary All Tokenizer implementations should have constructor that takes an AttributeSource All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory
        Hide
        Tim Smith added a comment -

        without the Tokenizer.reset(Reader, AttributeSource), i won't be able to reuse Tokenizer instances (will have to create a fresh one each time)
        this can get costly if each Tokenizer is layered with a bunch of TokenFilters
        obviously, adding this method would be nasty (and impose additional requirements on TokenFilters/Streams) but it would allow reusing the token streams to the utmost

        Question:
        Is the reflection penalty on the new TokenStream stuff incurred per root AttributeSource?, or per TokenFilter/TokenStream?
        that is, if i pass the same AttributeSource to 10 TokenStreams, is the reflection cost the same as if i passed it to just one?

        Show
        Tim Smith added a comment - without the Tokenizer.reset(Reader, AttributeSource), i won't be able to reuse Tokenizer instances (will have to create a fresh one each time) this can get costly if each Tokenizer is layered with a bunch of TokenFilters obviously, adding this method would be nasty (and impose additional requirements on TokenFilters/Streams) but it would allow reusing the token streams to the utmost Question: Is the reflection penalty on the new TokenStream stuff incurred per root AttributeSource?, or per TokenFilter/TokenStream? that is, if i pass the same AttributeSource to 10 TokenStreams, is the reflection cost the same as if i passed it to just one?
        Hide
        Uwe Schindler added a comment - - edited

        without the Tokenizer.reset(Reader, AttributeSource), i won't be able to reuse Tokenizer instances (will have to create a fresh one each time)

        This is not possible per design. The AttributeSource cannot be changed. It is created during creation of the classes (this is why it is in the ctor and nowhere else). For filters, the attributes come from the input token stream.

        EDIT: The TokenStream itsself is the AttributeSource, the ctor parameter AttributeSource only tells the ctor not to create new Attribute instances and reuse the maps from the given source. But Each TokenStream/Filter is always an AttributeSource itsself.

        Is the reflection penalty on the new TokenStream stuff incurred per root AttributeSource?, or per TokenFilter/TokenStream?

        The reflection penalty is one-time per class (because of static cache of "known" classes), so all attributeimpl are inspected one time when a new AttributeSouce like TokenStream is created. There is an additional reflection cost, when new attributes are added, but also one time per AttributeImpl class.
        Since the last changes in TokenStream the reflection is therefore no longer a penalty. The only problem is more work to construct an TokenStream (filling the LinkedHashMaps), because of that you should reuse TokenStream-chains.

        that is, if i pass the same AttributeSource to 10 TokenStreams, is the reflection cost the same as if i passed it to just one?

        No change!

        Show
        Uwe Schindler added a comment - - edited without the Tokenizer.reset(Reader, AttributeSource), i won't be able to reuse Tokenizer instances (will have to create a fresh one each time) This is not possible per design. The AttributeSource cannot be changed. It is created during creation of the classes (this is why it is in the ctor and nowhere else). For filters, the attributes come from the input token stream. EDIT: The TokenStream itsself is the AttributeSource, the ctor parameter AttributeSource only tells the ctor not to create new Attribute instances and reuse the maps from the given source. But Each TokenStream/Filter is always an AttributeSource itsself. Is the reflection penalty on the new TokenStream stuff incurred per root AttributeSource?, or per TokenFilter/TokenStream? The reflection penalty is one-time per class (because of static cache of "known" classes), so all attributeimpl are inspected one time when a new AttributeSouce like TokenStream is created. There is an additional reflection cost, when new attributes are added, but also one time per AttributeImpl class. Since the last changes in TokenStream the reflection is therefore no longer a penalty. The only problem is more work to construct an TokenStream (filling the LinkedHashMaps), because of that you should reuse TokenStream-chains. that is, if i pass the same AttributeSource to 10 TokenStreams, is the reflection cost the same as if i passed it to just one? No change!
        Michael Busch made changes -
        Assignee Michael Busch [ michaelbusch ]
        Hide
        Tim Smith added a comment -

        This is not possible per design. The AttributeSource cannot be changed.

        I fully understand why

        but...
        it should be rather easy to add a reset(AttributeSource input) to AttributeSource

        public void reset(AttributeSource input) {
            if (input == null) {
              throw new IllegalArgumentException("input AttributeSource must not be null");
            }
            this.attributes = input.attributes;
            this.attributeImpls = input.attributeImpls;
            this.factory = input.factory;
        }
        

        This would require making attributes and attributeImpls non-final (potentially reducing some jvm caching capabilities)

        However, this then provides the ability to do even more Attribute reuse
        For example, if this method existed, the Indexer could use a ThreadLocal of raw AttributeSources (one AttributeSource per thread)
        then, prior to calling TokenStream.reset(), it could call TokenStream.reset(ThreadLocal AttributeSource)

        This would result in all token streams for the same document using the same AttributeSource (reusing TermAttribute, etc)

        This would require that the no TokenStreams/Filters/Tokenizers call addAttribute() in the constructor (they would have to do this in reset())

        I totally get that this is a tall order
        If you want i can open a separate ticket for this (AttributeSource.reset(AttributeSource)) for further consideration

        Show
        Tim Smith added a comment - This is not possible per design. The AttributeSource cannot be changed. I fully understand why but... it should be rather easy to add a reset(AttributeSource input) to AttributeSource public void reset(AttributeSource input) { if (input == null ) { throw new IllegalArgumentException( "input AttributeSource must not be null " ); } this .attributes = input.attributes; this .attributeImpls = input.attributeImpls; this .factory = input.factory; } This would require making attributes and attributeImpls non-final (potentially reducing some jvm caching capabilities) However, this then provides the ability to do even more Attribute reuse For example, if this method existed, the Indexer could use a ThreadLocal of raw AttributeSources (one AttributeSource per thread) then, prior to calling TokenStream.reset(), it could call TokenStream.reset(ThreadLocal AttributeSource) This would result in all token streams for the same document using the same AttributeSource (reusing TermAttribute, etc) This would require that the no TokenStreams/Filters/Tokenizers call addAttribute() in the constructor (they would have to do this in reset()) I totally get that this is a tall order If you want i can open a separate ticket for this (AttributeSource.reset(AttributeSource)) for further consideration
        Hide
        Michael Busch added a comment -

        Patch adds additional constructors that take AttributeSource or AttributeFactory to all tokenizers (core + contrib).

        It doesn't add additional reset() methods. We have to discuss that more and I think it's too late for 2.9.

        I'll commit this tomorrow if nobody objects.

        Show
        Michael Busch added a comment - Patch adds additional constructors that take AttributeSource or AttributeFactory to all tokenizers (core + contrib). It doesn't add additional reset() methods. We have to discuss that more and I think it's too late for 2.9. I'll commit this tomorrow if nobody objects.
        Michael Busch made changes -
        Attachment lucene-1826.patch [ 12417343 ]
        Hide
        Tim Smith added a comment -

        i'll fork off another ticket for the reset(AttributeSource) method

        Show
        Tim Smith added a comment - i'll fork off another ticket for the reset(AttributeSource) method
        Hide
        Tim Smith added a comment -

        forked off the reset(AttributeSource) to LUCENE-1842

        Show
        Tim Smith added a comment - forked off the reset(AttributeSource) to LUCENE-1842
        Hide
        Michael Busch added a comment -

        Committed revision 806942.

        Show
        Michael Busch added a comment - Committed revision 806942.
        Michael Busch made changes -
        Resolution Fixed [ 1 ]
        Status Open [ 1 ] Resolved [ 5 ]
        Mark Miller made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Mark Thomas made changes -
        Workflow jira [ 12473728 ] Default workflow, editable Closed status [ 12563848 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12563848 ] jira [ 12584484 ]

          People

          • Assignee:
            Michael Busch
            Reporter:
            Tim Smith
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development