[LUCY-179] Tighten UTF-8 validity checks. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.3.0 (incubating)
Component/s: Util
Labels:
None

Description

Lucy currently outsources UTF-8 validity checking to the Perl C API function
is_utf8_string(). This suffices for sanity checking of basic byte sequences
and detecting non-shortest-form, but since is_utf8_string() only validates to
the loose Perl internal "utf8" format[1], it allows through certain constructs
we should probably thwart: UTF-8 coded UTF-16 surrogates (both paired and
isolated), and code points above 0x10FFFF.

Since Lucy is not an application but rather a library, we should continue to
pass through "noncharacter" code points which are discouraged for "public
exchange"[2] but are allowed for internal application use, such as U+FFFF.
(Such code points may be useful as e.g. sentinels or separators). These code
points will be allowed to end up in indexes; it will be the responsibility of
the application to filter them at input or output.

[1] http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8

[2] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf section 3.2, clause C2

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

utf8_validation.patch
06/Sep/11 04:58
14 kB
Marvin Humphrey

Activity

People

Assignee:: Marvin Humphrey

Reporter:: Marvin Humphrey

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 06/Sep/11 04:48

Updated:: 06/Sep/11 22:21

Resolved:: 06/Sep/11 22:21