Lucene - Core
  1. Lucene - Core
  2. LUCENE-4343

Clear up more Tokenizer.setReader/TokenStream.reset issues

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      spinoff from user-list thread.

      I think the rename helps, but the javadocs still have problems: they seem to only describe a totally wacky case (CachingTokenFilter) and not the normal case.

      Ideally setReader would be final I think, but there are a few crazy tokenstreams to fix before I could make that work. Would also need something hackish so MockTokenizer's state machine is still functional.

      But i worked on fixing up the mess in our various tokenstreams, which is easy for the most part.

      As part of this I found it was really useful in flushing out test bugs (ones that dont use MockTokenizer, which they really should), if we can do some best-effort exceptions when the consumer is broken and it costs nothing.

      For example:

      -  private int offset = 0, bufferIndex = 0, dataLen = 0, finalOffset = 0;
      +  // note: bufferIndex is -1 here to best-effort AIOOBE consumers that don't call reset()
      +  private int offset = 0, bufferIndex = -1, dataLen = 0, finalOffset = 0;
      

      I think this is worth exploring more... this was really effective at finding broken tests etc. We should see if we can be more thorough/ideally throw better exceptions when consumers are broken and its free.

      1. LUCENE-4343.patch
        23 kB
        Robert Muir
      2. LUCENE-4343.patch
        23 kB
        Robert Muir
      3. LUCENE-4343.patch
        19 kB
        Robert Muir
      4. LUCENE-4343.patch
        17 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Here's a first stab.

        Show
        Robert Muir added a comment - Here's a first stab.
        Hide
        Michael McCandless added a comment -

        +1

        We had a lot of tokenizers abusing setReader for stuff they should be doing in reset!

        It would be really nice to make setReader final but that sounds like a challenge...

        Show
        Michael McCandless added a comment - +1 We had a lot of tokenizers abusing setReader for stuff they should be doing in reset! It would be really nice to make setReader final but that sounds like a challenge...
        Hide
        Robert Muir added a comment -

        I think we can still make it final: we just have 2 or 3 bad apples to figure out

        Show
        Robert Muir added a comment - I think we can still make it final: we just have 2 or 3 bad apples to figure out
        Hide
        Chris Male added a comment -

        +1 to the improvements and pursuing making it final.

        Show
        Chris Male added a comment - +1 to the improvements and pursuing making it final.
        Hide
        Robert Muir added a comment -

        updated patch:

        • MockTokenizer's state machine works via an assert-only testpoint.
        • fixed PatternTokenizer to not consume in its ctor

        two more tokenstreams in solr to fix and we are good: PreAnalyzedField and TrieTokenizerFactory.

        Show
        Robert Muir added a comment - updated patch: MockTokenizer's state machine works via an assert-only testpoint. fixed PatternTokenizer to not consume in its ctor two more tokenstreams in solr to fix and we are good: PreAnalyzedField and TrieTokenizerFactory.
        Hide
        Robert Muir added a comment -

        Updated patch: with setReader as final

        Show
        Robert Muir added a comment - Updated patch: with setReader as final
        Hide
        Michael McCandless added a comment -

        Updated patch: with setReader as final

        Yay! Patch looks great. +1

        Show
        Michael McCandless added a comment - Updated patch: with setReader as final Yay! Patch looks great. +1
        Hide
        Robert Muir added a comment -

        Just a tiny update to make PreAnalyzedTokenizer more correct.

        Show
        Robert Muir added a comment - Just a tiny update to make PreAnalyzedTokenizer more correct.
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development