Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1906

Backwards problems with CharStream and Tokenizers with custom reset(Reader) method

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When reviewing the new CharStream code added to Tokenizers, I found a
      serious problem with backwards compatibility and other Tokenizers, that do
      not override reset(CharStream).

      The problem is, that e.g. CharTokenizer only overrides reset(Reader):

        public void reset(Reader input) throws IOException {
          super.reset(input);
          bufferIndex = 0;
          offset = 0;
          dataLen = 0;
        }
      

      If you reset such a Tokenizer with another CharStream (not a Reader), this
      method will never be called and breaking the whole Tokenizer.

      As CharStream extends Reader, I propose to remove this reset(CharStream
      method) and simply do an instanceof check to detect if the supplied Reader
      is no CharStream and wrap it. We could also remove the extra ctor (because
      most Tokenizers have no support for passing CharStreams). If the ctor also
      checks with instanceof and warps as needed the code is backwards compatible
      and we do not need to add additional ctors in subclasses.

      As this instanceof check is always done in CharReader.get() why not remove
      ctor(CharStream) and reset(CharStream) completely?

      Any thoughts?

      I would like to fix this somehow before RC4, I'm, sorry

        Attachments

        1. backwards-break.patch
          4 kB
          Uwe Schindler
        2. LUCENE-1906_contrib.patch
          6 kB
          Robert Muir
        3. LUCENE-1906.patch
          22 kB
          Uwe Schindler
        4. LUCENE-1906.patch
          16 kB
          Uwe Schindler
        5. LUCENE-1906.patch
          8 kB
          Uwe Schindler
        6. LUCENE-1906.patch
          2 kB
          Uwe Schindler
        7. LUCENE-1906-bw.patch
          6 kB
          Uwe Schindler

          Issue Links

            Activity

              People

              • Assignee:
                thetaphi Uwe Schindler
                Reporter:
                thetaphi Uwe Schindler
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: