Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2016

replace invalid U+FFFF character during indexing

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4, 2.4.1, 2.9
    • Fix Version/s: 2.9.1, 3.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict. CheckIndex will catch the error, and merging will hit exceptions (I think).

      We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

        Attachments

        1. LUCENE-2016.patch
          2 kB
          Michael McCandless

          Activity

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: