Uploaded image for project: 'Daffodil'
  1. Daffodil
  2. DAFFODIL-931

Variable-width charset with 'replace' can result in wrong length calculations

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: s12
    • Fix Version/s: 2.2.0
    • Component/s: Back End, General
    • Labels:
      None

      Description

      Given a utf-8 string with a single-byte non-decodable byte in the middle.

      When we parse this the non-decodable byte will contribute a unicode replacement character to the string. 0xFFFD is the character code.

      If you then take this string and call getBytes("utf-8") on it, you will not get the right length. You will get 3 instead of 1 for the error because 0xFFFD takes 3 bytes in utf-8.

      The way we are measuring how far to move ahead in bytes right now, when we have a variable-width encoding like UTF-8, is to do exactly the above, call getBytes to find how long the string was.

      This will cause us to move too far ahead into the data.

      Test case to illustrate is TBD, but isn't too hard to put together. Just put a string per above with length coming from an expression. Put the string between two binary int fields. The binary int field after will not be parsed properly. because we will advance too far on the string.

        Attachments

          Activity

            People

            • Assignee:
              slawrence Steve Lawrence
              Reporter:
              mbeckerle Michael Beckerle
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: