Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2016

replace invalid U+FFFF character during indexing

Details

    • Bug
    • Status: Reopened
    • Major
    • Resolution: Fixed
    • 2.4, 2.4.1, 2.9
    • 2.9.1, 3.0
    • None
    • None
    • New

    Description

      If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict. CheckIndex will catch the error, and merging will hit exceptions (I think).

      We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

      Attachments

        1. LUCENE-2016.patch
          2 kB
          Michael McCandless

        Activity

          People

            mikemccand Michael McCandless
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: