[LUCENE-2016] replace invalid U+FFFF character during indexing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4, 2.4.1, 2.9
Fix Version/s: 2.9.1, 3.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict. CheckIndex will catch the error, and merging will hit exceptions (I think).

We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-2016.patch
29/Oct/09 17:10
2 kB
Michael McCandless

Activity

People

Assignee:: Michael McCandless

Reporter:: Michael McCandless

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 29/Oct/09 17:05

Updated:: 28/Aug/22 12:12

Resolved:: 07/Nov/09 14:53