Uploaded image for project: 'Lucy'
  1. Lucy
  2. LUCY-179

Tighten UTF-8 validity checks.

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.3.0 (incubating)
    • Component/s: Util
    • Labels:
      None

      Description

      Lucy currently outsources UTF-8 validity checking to the Perl C API function
      is_utf8_string(). This suffices for sanity checking of basic byte sequences
      and detecting non-shortest-form, but since is_utf8_string() only validates to
      the loose Perl internal "utf8" format[1], it allows through certain constructs
      we should probably thwart: UTF-8 coded UTF-16 surrogates (both paired and
      isolated), and code points above 0x10FFFF.

      Since Lucy is not an application but rather a library, we should continue to
      pass through "noncharacter" code points which are discouraged for "public
      exchange"[2] but are allowed for internal application use, such as U+FFFF.
      (Such code points may be useful as e.g. sentinels or separators). These code
      points will be allowed to end up in indexes; it will be the responsibility of
      the application to filter them at input or output.

      [1] http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8

      [2] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf section 3.2, clause C2

        Attachments

          Activity

            People

            • Assignee:
              marvin Marvin Humphrey
              Reporter:
              marvin Marvin Humphrey
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: