Uploaded image for project: 'Lucy'
  1. Lucy
  2. LUCY-179

Tighten UTF-8 validity checks.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.3.0 (incubating)
    • Util
    • None

    Description

      Lucy currently outsources UTF-8 validity checking to the Perl C API function
      is_utf8_string(). This suffices for sanity checking of basic byte sequences
      and detecting non-shortest-form, but since is_utf8_string() only validates to
      the loose Perl internal "utf8" format[1], it allows through certain constructs
      we should probably thwart: UTF-8 coded UTF-16 surrogates (both paired and
      isolated), and code points above 0x10FFFF.

      Since Lucy is not an application but rather a library, we should continue to
      pass through "noncharacter" code points which are discouraged for "public
      exchange"[2] but are allowed for internal application use, such as U+FFFF.
      (Such code points may be useful as e.g. sentinels or separators). These code
      points will be allowed to end up in indexes; it will be the responsibility of
      the application to filter them at input or output.

      [1] http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8

      [2] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf section 3.2, clause C2

      Attachments

        1. utf8_validation.patch
          14 kB
          Marvin Humphrey

        Activity

          People

            marvin Marvin Humphrey
            marvin Marvin Humphrey
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: