Solr
  1. Solr
  2. SOLR-1377

Force TokenizerFactory to create a Tokenizer rather then TokenStream

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      The new token reuse classes require that they are created with a Tokenizer. The solr TokenizerFactory interface currently makes a TokenStream.

      Although this is an API breaking change, the alternative is to just document that it needs to be a Tokenizer instance and throw an error when it is not.

      For more discussion, see:
      http://www.lucidimagination.com/search/document/272b8c4e6198d887/trunk_classcastexception_with_basetokenizerfactory

      1. SOLR-1377.patch
        10 kB
        Yonik Seeley
      2. SOLR-1377-Tokenizer.patch
        9 kB
        Ryan McKinley

        Activity

        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4
        Hide
        Ryan McKinley added a comment -

        Without objection... I will commit this soon....

        Show
        Ryan McKinley added a comment - Without objection... I will commit this soon....
        Hide
        Yonik Seeley added a comment -

        For the Pattern implementation, all the tokens are created beforehand and are just passed off with iter.next(), so if the input changes, the whole thing would need to change.

        And it does now... I moved the creation of the Token<List> to init() so it's recreated with every reset.

        Any reason not to implement reset on: TrieTokenizerFactory?

        TrieTokenizer (right below the factory) already implements reset(Reader).

        Show
        Yonik Seeley added a comment - For the Pattern implementation, all the tokens are created beforehand and are just passed off with iter.next(), so if the input changes, the whole thing would need to change. And it does now... I moved the creation of the Token<List> to init() so it's recreated with every reset. Any reason not to implement reset on: TrieTokenizerFactory? TrieTokenizer (right below the factory) already implements reset(Reader).
        Hide
        Ryan McKinley added a comment -

        Is reset gaurenteed to be called on the same Reader? For the Pattern implementation, all the tokens are created beforehand and are just passed off with iter.next(), so if the input changes, the whole thing would need to change.

        + public void reset(Reader input) throws IOException

        { + super.reset(input); + init(); + }

        Any reason not to implement reset on: TrieTokenizerFactory?

        Show
        Ryan McKinley added a comment - Is reset gaurenteed to be called on the same Reader? For the Pattern implementation, all the tokens are created beforehand and are just passed off with iter.next(), so if the input changes, the whole thing would need to change. + public void reset(Reader input) throws IOException { + super.reset(input); + init(); + } Any reason not to implement reset on: TrieTokenizerFactory?
        Hide
        Yonik Seeley added a comment -

        Uploading another patch based on yours that implements reuse (reset(Reader)) for the Tokenizers.

        +1

        Show
        Yonik Seeley added a comment - Uploading another patch based on yours that implements reuse (reset(Reader)) for the Tokenizers. +1
        Hide
        Ryan McKinley added a comment -

        Here is a patch that:

        1. Changes the TokenizerFactory to return a Tokenizer
        2. Updates all TokenizerFactory classes to explicitly return a Tokenizer
        3. Changes the PatternTokenizerFactory to return a Tokenizer
        4. adds a test that calls PatternTokenizer

        • - -

        Since this is an API breaking change, I added this to the "Upgrading from Solr 1.3" section of CHANGES.txt:

        The TokenizerFactory API has changed to explicitly return a Tokenizer rather then
        a TokenStream (that may be or may not be a Tokenizer). This change is required
        to take advantage of the Token reuse improvements in lucene 2.9. For more
        information, see SOLR-1377.

        I'll wait for two +1 votes on this, since it does break back compatibility

        Show
        Ryan McKinley added a comment - Here is a patch that: 1. Changes the TokenizerFactory to return a Tokenizer 2. Updates all TokenizerFactory classes to explicitly return a Tokenizer 3. Changes the PatternTokenizerFactory to return a Tokenizer 4. adds a test that calls PatternTokenizer - - Since this is an API breaking change, I added this to the "Upgrading from Solr 1.3" section of CHANGES.txt: The TokenizerFactory API has changed to explicitly return a Tokenizer rather then a TokenStream (that may be or may not be a Tokenizer). This change is required to take advantage of the Token reuse improvements in lucene 2.9. For more information, see SOLR-1377 . I'll wait for two +1 votes on this, since it does break back compatibility

          People

          • Assignee:
            Ryan McKinley
            Reporter:
            Ryan McKinley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development