Lucene - Core
  1. Lucene - Core
  2. LUCENE-5240

additional safety in Tokenizer state machine

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

         * <b>NOTE:</b> 
         * The default implementation closes the input Reader, so
         * be sure to call <code>super.close()</code> when overriding this method.
         */
        @Override
        public void close() throws IOException {
      

      We can add a simple check for this easily now in setReader. I found a few bugs, and fixed all except TrieTokenizer in solr (I am lost here... somewhere i have a patch to remove this thing).

      1. LUCENE-5240.patch
        22 kB
        Robert Muir
      2. LUCENE-5240.patch
        21 kB
        Robert Muir
      3. LUCENE-5240.patch
        8 kB
        Robert Muir

        Activity

        Hide
        Uwe Schindler added a comment -

        This is a good idea!

        Show
        Uwe Schindler added a comment - This is a good idea!
        Hide
        Michael McCandless added a comment -

        +1, this is great. Fun to see all the tests that were violating the TS API ...

        Show
        Michael McCandless added a comment - +1, this is great. Fun to see all the tests that were violating the TS API ...
        Hide
        Uwe Schindler added a comment -

        I don't know the problem of TrieTokenizer. To me, this one looks good, but it may have a problem with the crazy "hasValue". This boolean seems to be used to emulate an empty TokenStream if the input is empty (which NumericTokenStream cannot do).

        The TrieFields no longer need a Tokenizer in Solr, so we should remove this one (it now uses NumericField internally).

        Show
        Uwe Schindler added a comment - I don't know the problem of TrieTokenizer. To me, this one looks good, but it may have a problem with the crazy "hasValue". This boolean seems to be used to emulate an empty TokenStream if the input is empty (which NumericTokenStream cannot do). The TrieFields no longer need a Tokenizer in Solr, so we should remove this one (it now uses NumericField internally).
        Hide
        Robert Muir added a comment -

        updated patch including removal of TrieTokenizer... testing it now.

        Show
        Robert Muir added a comment - updated patch including removal of TrieTokenizer... testing it now.
        Hide
        Robert Muir added a comment -

        Updated patch: fixes a bug in AnalysisRequestHandlerBase found by a test.

        Show
        Robert Muir added a comment - Updated patch: fixes a bug in AnalysisRequestHandlerBase found by a test.
        Hide
        Uwe Schindler added a comment - - edited

        OK! Thanks for removing TrieTokencorrumpter! Just one question: Is the TrieTokenizer really not needed for highlighting? I know, the TrieField is now "tokenized=false", but I have no idea how this affects the highlighter. On the other hand, highlighting a trie field is completely useless

        I know, ElasticSearch has cloned TrieTokenizer and it uses it for highlighting... Adrien knows more, he did that porting.

        For 4.x, we have to add TrieTokenizerFactory to the backwards breaks, because somebody might have used it... We should suggest to use a real TrieField in that case!

        Show
        Uwe Schindler added a comment - - edited OK! Thanks for removing TrieTokencorrumpter! Just one question: Is the TrieTokenizer really not needed for highlighting? I know, the TrieField is now "tokenized=false", but I have no idea how this affects the highlighter. On the other hand, highlighting a trie field is completely useless I know, ElasticSearch has cloned TrieTokenizer and it uses it for highlighting... Adrien knows more, he did that porting. For 4.x, we have to add TrieTokenizerFactory to the backwards breaks, because somebody might have used it... We should suggest to use a real TrieField in that case!
        Hide
        Robert Muir added a comment -

        solr doesnt support highlighting trie fields... so its really not used at all today.

        In my opinion this tokenizer just causes us to scratch our heads all the time and thats it, no benefits just wasting of committers time

        I am happy to also only remove it in trunk, i dont care. for 4.x i could just svn revert the changes to Tokenizer.java but still backport all the fixes in the patch.

        Show
        Robert Muir added a comment - solr doesnt support highlighting trie fields... so its really not used at all today. In my opinion this tokenizer just causes us to scratch our heads all the time and thats it, no benefits just wasting of committers time I am happy to also only remove it in trunk, i dont care. for 4.x i could just svn revert the changes to Tokenizer.java but still backport all the fixes in the patch.
        Hide
        ASF subversion and git services added a comment -

        Commit 1529482 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1529482 ]

        LUCENE-5240: additional safety in Tokenizer state machine

        Show
        ASF subversion and git services added a comment - Commit 1529482 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1529482 ] LUCENE-5240 : additional safety in Tokenizer state machine
        Hide
        ASF subversion and git services added a comment -

        Commit 1529486 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1529486 ]

        LUCENE-5240: additional safety in Tokenizer state machine

        Show
        ASF subversion and git services added a comment - Commit 1529486 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1529486 ] LUCENE-5240 : additional safety in Tokenizer state machine

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development