Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I was working on a new Tokenizer... and I accidentally forgot to call super(input) (and super.reset(input) from my reset method)... which then meant my correctOffset() calls were silently a no-op; this is very trappy.

      Fortunately the awesome BaseTokenStreamTestCase caught this (I hit failures because the offsets were not in fact being corrected).

      One minimal thing we can do (but it sounds like from Robert there may be reasons why we can't) is add assert input != null in Tokenizer.correctOffset:

      Index: lucene/core/src/java/org/apache/lucene/analysis/Tokenizer.java
      ===================================================================
      --- lucene/core/src/java/org/apache/lucene/analysis/Tokenizer.java	(revision 1242316)
      +++ lucene/core/src/java/org/apache/lucene/analysis/Tokenizer.java	(working copy)
      @@ -82,6 +82,7 @@
          * @see CharStream#correctOffset
          */
         protected final int correctOffset(int currentOff) {
      +    assert input != null: "subclass failed to call super(Reader) or super.reset(Reader)";
           return (input instanceof CharStream) ? ((CharStream) input).correctOffset(currentOff) : currentOff;
         }
      

      But best would be to remove the default ctor that leaves input null...

      1. LUCENE-3766.patch
        38 kB
        Robert Muir
      2. LUCENE-3766.patch
        19 kB
        Michael McCandless

        Activity

        Hide
        Robert Muir added a comment -

        I think its strange that tokenizer allows a null reader, but we can forget about that temporarily and nuke this default ctor.

        Then, we can put assert reader != null in the 'real' ctor and see which analyzers fail.
        Maybe its only PatternAnalyzer and it could be implemented differently/properly (e.g. make a TokenStream).

        Then we could also keep our assert.

        Show
        Robert Muir added a comment - I think its strange that tokenizer allows a null reader, but we can forget about that temporarily and nuke this default ctor. Then, we can put assert reader != null in the 'real' ctor and see which analyzers fail. Maybe its only PatternAnalyzer and it could be implemented differently/properly (e.g. make a TokenStream). Then we could also keep our assert.
        Hide
        Uwe Schindler added a comment -

        As far as I remember this was because of Solr. Some tokenizers in Solr were originally TokenStreams. After we restructured all of them, there is no reason to keep default ctor or allow null at all.

        Show
        Uwe Schindler added a comment - As far as I remember this was because of Solr. Some tokenizers in Solr were originally TokenStreams. After we restructured all of them, there is no reason to keep default ctor or allow null at all.
        Hide
        Michael McCandless added a comment -

        Here's an initial patch. Tests pass, but, it's hacky to pass new StringReader("") to all those test tokenizers... really they should be TokenStreams instead.

        Show
        Michael McCandless added a comment - Here's an initial patch. Tests pass, but, it's hacky to pass new StringReader("") to all those test tokenizers... really they should be TokenStreams instead.
        Hide
        Robert Muir added a comment -

        updated patch: i tried to clean up these tests some. I think its good to go now.

        Show
        Robert Muir added a comment - updated patch: i tried to clean up these tests some. I think its good to go now.
        Hide
        Michael McCandless added a comment -

        +1

        I like CannedTokenStream

        Show
        Michael McCandless added a comment - +1 I like CannedTokenStream

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development