Solr
  1. Solr
  2. SOLR-1377

Force TokenizerFactory to create a Tokenizer rather then TokenStream

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      The new token reuse classes require that they are created with a Tokenizer. The solr TokenizerFactory interface currently makes a TokenStream.

      Although this is an API breaking change, the alternative is to just document that it needs to be a Tokenizer instance and throw an error when it is not.

      For more discussion, see:
      http://www.lucidimagination.com/search/document/272b8c4e6198d887/trunk_classcastexception_with_basetokenizerfactory

      1. SOLR-1377.patch
        10 kB
        Yonik Seeley
      2. SOLR-1377-Tokenizer.patch
        9 kB
        Ryan McKinley

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        3d 1h 26m 1 Ryan McKinley 24/Aug/09 19:59
        Resolved Resolved Closed Closed
        77d 20h 52m 1 Grant Ingersoll 10/Nov/09 15:52
        Grant Ingersoll made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4
        Ryan McKinley made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Ryan McKinley added a comment -

        Without objection... I will commit this soon....

        Show
        Ryan McKinley added a comment - Without objection... I will commit this soon....
        Hide
        Yonik Seeley added a comment -

        For the Pattern implementation, all the tokens are created beforehand and are just passed off with iter.next(), so if the input changes, the whole thing would need to change.

        And it does now... I moved the creation of the Token<List> to init() so it's recreated with every reset.

        Any reason not to implement reset on: TrieTokenizerFactory?

        TrieTokenizer (right below the factory) already implements reset(Reader).

        Show
        Yonik Seeley added a comment - For the Pattern implementation, all the tokens are created beforehand and are just passed off with iter.next(), so if the input changes, the whole thing would need to change. And it does now... I moved the creation of the Token<List> to init() so it's recreated with every reset. Any reason not to implement reset on: TrieTokenizerFactory? TrieTokenizer (right below the factory) already implements reset(Reader).
        Hide
        Ryan McKinley added a comment -

        Is reset gaurenteed to be called on the same Reader? For the Pattern implementation, all the tokens are created beforehand and are just passed off with iter.next(), so if the input changes, the whole thing would need to change.

        + public void reset(Reader input) throws IOException

        { + super.reset(input); + init(); + }

        Any reason not to implement reset on: TrieTokenizerFactory?

        Show
        Ryan McKinley added a comment - Is reset gaurenteed to be called on the same Reader? For the Pattern implementation, all the tokens are created beforehand and are just passed off with iter.next(), so if the input changes, the whole thing would need to change. + public void reset(Reader input) throws IOException { + super.reset(input); + init(); + } Any reason not to implement reset on: TrieTokenizerFactory?
        Yonik Seeley made changes -
        Attachment SOLR-1377.patch [ 12417321 ]
        Hide
        Yonik Seeley added a comment -

        Uploading another patch based on yours that implements reuse (reset(Reader)) for the Tokenizers.

        +1

        Show
        Yonik Seeley added a comment - Uploading another patch based on yours that implements reuse (reset(Reader)) for the Tokenizers. +1
        Ryan McKinley made changes -
        Field Original Value New Value
        Attachment SOLR-1377-Tokenizer.patch [ 12417304 ]
        Hide
        Ryan McKinley added a comment -

        Here is a patch that:

        1. Changes the TokenizerFactory to return a Tokenizer
        2. Updates all TokenizerFactory classes to explicitly return a Tokenizer
        3. Changes the PatternTokenizerFactory to return a Tokenizer
        4. adds a test that calls PatternTokenizer

        • - -

        Since this is an API breaking change, I added this to the "Upgrading from Solr 1.3" section of CHANGES.txt:

        The TokenizerFactory API has changed to explicitly return a Tokenizer rather then
        a TokenStream (that may be or may not be a Tokenizer). This change is required
        to take advantage of the Token reuse improvements in lucene 2.9. For more
        information, see SOLR-1377.

        I'll wait for two +1 votes on this, since it does break back compatibility

        Show
        Ryan McKinley added a comment - Here is a patch that: 1. Changes the TokenizerFactory to return a Tokenizer 2. Updates all TokenizerFactory classes to explicitly return a Tokenizer 3. Changes the PatternTokenizerFactory to return a Tokenizer 4. adds a test that calls PatternTokenizer - - Since this is an API breaking change, I added this to the "Upgrading from Solr 1.3" section of CHANGES.txt: The TokenizerFactory API has changed to explicitly return a Tokenizer rather then a TokenStream (that may be or may not be a Tokenizer). This change is required to take advantage of the Token reuse improvements in lucene 2.9. For more information, see SOLR-1377 . I'll wait for two +1 votes on this, since it does break back compatibility
        Ryan McKinley created issue -

          People

          • Assignee:
            Ryan McKinley
            Reporter:
            Ryan McKinley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development