Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8651

Tokenizer implementations can't be reset

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • modules/analysis
    • None
    • New, Patch Available

    Description

      The fine print here is that they can't be reset without calling setReader() every time before reset() is called. The reason for this is that Tokenizer violates the contract put forth by TokenStream.reset() which is the following:

      "Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh."

      Tokenizer implementation's reset function can't reset in that manner because their Tokenizer.close() removes the reference to the underlying Reader because of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep around a Reader (memory leak) but we would like to be able to reset() if necessary.

      The patches include an integration test that attempts to use a ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer TokenStream. This test fails with an IllegalStateException thrown by Tokenizer.ILLEGAL_STATE_READER.

       

      Attachments

        1. LUCENE-8651.patch
          10 kB
          Alan Woodward
        2. LUCENE-8651.patch
          2 kB
          Dan Meehl
        3. LUCENE-8650-2.patch
          11 kB
          Dan Meehl

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dmeehl Dan Meehl
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: