Xerces-C++
  1. Xerces-C++
  2. XERCESC-1984

TranscodeToStr::transcode throws an exception when transcoding to UTF-8

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.2.0, 4.0.0
    • Fix Version/s: 3.2.0
    • Component/s: Utilities
    • Environment:
      Bug reproducible on a Red Hat 5 based platform. The bug doesn't seem to be platform specific though.

      Description

      This issue relates to the bug fix for issue XERCESC-1947. There are still cases where the method will fail in providing a transcoded version without throwing an exception. See the attached "transtest2.cpp" to reproduce the issue.

      The cause seems to come from the added "if((allocSize - fBytesWritten) < (len - charsDone))" condition in "TranscodeToStr::transcode" . In my provided test case I have a string composed of 6 Japanese characters (i.e. "絞り込み検索"). Once the first call to "XMLUTF8Transcoder::transcodeTo" is done, "charsRead" will return a count of 5 XMLCh readed. Since the initial allocated buffer for this string was set to 16 bytes, the condition will check against the following values "if((16 - 15) < (6 - 5))" which avoids the reallocation of a larger buffer for the UTF-8 encoded version of the string.

      Since the reallocation doesn't take place, the code will recall "XMLUTF8Transcoder::transcodeTo" but this time the "charsRead" count will be set to 0 because there is insufficient space in the buffer and this will trigger an exception of type "Trans_BadSrcSeq".

      I suppose that the goal of this added condition was to avoid an unnecessary reallocation of a buffer but unfortunately it doesn’t work when transcoding to variable length encoding like UTF-8. The solution is probably to simply replace the condition with "if(charsDone < len)".

      Regards,
      Dan

        Issue Links

          Activity

          Hide
          Alberto Massari added a comment -

          A fix is in SVN. Please verify.

          Show
          Alberto Massari added a comment - A fix is in SVN. Please verify.
          Hide
          Lee Doron added a comment -

          The problem with "if(charsDone < len)" is that it increases (doubles) allocSize even if the transcoder exited for reasons other than running out of available output buffer space; that can happen if the input buffer ends with the leading character of a surrogate pair. Why increase it if there might be plenty of space?

          I suggest changing the conditional to:

          if(charsDone < len && (allocSize - fBytesWritten) < 4)

          This ensures that there are at least 4 bytes available, which is always enough to hold at least one more multi-byte character, so charsRead won't be 0 for lack of space. (I don't believe any encodings use more than 4 bytes for a character, right?)

          Likewise, I'd change the corresponding conditional in TranscodeFromStr::transcode() from:

          if(((allocSize - fCharsWritten)*sizeof(XMLCh)) < (length - bytesDone))

          to:

          if(bytesDone < length && (allocSize - fCharsWritten) < 2)

          There's no reason to multiply by sizeof(XMLCh) here. However, we do need to make sure there's enough room for the largest representation we might get from a transcoder, which is 2 XMLCh entries (a surrogate pair).

          These could be simplified slightly by moving the entire "if" blocks to the very beginning of each loop. At that point, we know that "charsDone < len" (or, respectively, "bytesDone < length"), and we can leave out the first clause of each conditional. It will always be skipped the first time through the loop.

          Show
          Lee Doron added a comment - The problem with "if(charsDone < len)" is that it increases (doubles) allocSize even if the transcoder exited for reasons other than running out of available output buffer space; that can happen if the input buffer ends with the leading character of a surrogate pair. Why increase it if there might be plenty of space? I suggest changing the conditional to: if(charsDone < len && (allocSize - fBytesWritten) < 4) This ensures that there are at least 4 bytes available, which is always enough to hold at least one more multi-byte character, so charsRead won't be 0 for lack of space. (I don't believe any encodings use more than 4 bytes for a character, right?) Likewise, I'd change the corresponding conditional in TranscodeFromStr::transcode() from: if(((allocSize - fCharsWritten)*sizeof(XMLCh)) < (length - bytesDone)) to: if(bytesDone < length && (allocSize - fCharsWritten) < 2) There's no reason to multiply by sizeof(XMLCh) here. However, we do need to make sure there's enough room for the largest representation we might get from a transcoder, which is 2 XMLCh entries (a surrogate pair). These could be simplified slightly by moving the entire "if" blocks to the very beginning of each loop. At that point, we know that "charsDone < len" (or, respectively, "bytesDone < length"), and we can leave out the first clause of each conditional. It will always be skipped the first time through the loop.
          Hide
          Dan PV added a comment -

          Test case.

          Show
          Dan PV added a comment - Test case.

            People

            • Assignee:
              Alberto Massari
              Reporter:
              Dan PV
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development