Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2016

replace invalid U+FFFF character during indexing

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4, 2.4.1, 2.9
    • Fix Version/s: 2.9.1, 3.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict. CheckIndex will catch the error, and merging will hit exceptions (I think).

      We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

        Attachments

          Activity

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              mikemccand Michael McCandless

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment