Uploaded image for project: 'Daffodil'
  1. Daffodil
  2. DAFFODIL-1979

UTF8 decoder doesn't handle 3-byte and 4-byte correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.2.0
    • 2.2.0
    • Back End
    • None

    Description

      It is classifying some valid characters as "overlong" and erroring out.

      The PNG schema on DFDLSchemas github has 1 test that runs into this bug on 3 byte Devangari script characters.

      This is 6 devangari characters: e0 a4 b6 e0 a5 80 e0 a4 b0 e0 a5 8d e0 a4 b7 e0 a4 95
      Should be: शीर्षक

      But is coming out all substitution chars.

      In 3 byte utf-8, the bits that at least one of must be non-zero are shown here in M, notice one of them is in the second byte. This second byte wasn't being tested.

      1110MMMM 10Mxxxxx 10xxxxxx

      In 4 byte utf-8, the bits that must at least one of be non-zero are:

      11110 MMM 10MMxxxx 10xxxxxx 10xxxxxx

      Attachments

        Activity

          People

            dfthompson Dave Thompson
            mbeckerle Mike Beckerle
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: