Xerces-C++
  1. Xerces-C++
  2. XERCESC-1947

XMLUTF8Transcoder::transcodeTo fails with an exception when transcoding single characters that require 3 or more bytes as UTF8.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.1.0, 3.1.1
    • Fix Version/s: 3.2.0
    • Component/s: Utilities
    • Labels:
      None
    • Environment:
      Tested on mac os and debian linux. The failure is only manifest on v3.1.x

      Description

      This can be demonstrated with the following 2 lines of code.

      const XMLCh uval [] =

      { 0x254B, 0x0000}

      ; //BOX DRAWINGS HEAVY VERTICAL AND HORIZONTAL (needs 3 bytes for utf-8)
      char* uc = (char*)TranscodeToStr(uval,"UTF-8").adopt(); cout << uc << endl << flush; XMLString::release(&uc); //faulty exception;

      The error is: "terminate called after throwing an instance of 'xercesc_3_1::TranscodingException'"

      1. TransService.cpp.patch
        5 kB
        Lee Doron
      2. TransService.patch
        1 kB
        Ben Griffin
      3. transtest.cpp
        0.7 kB
        Ben Griffin

        Issue Links

          Activity

          Hide
          Ben Griffin added a comment -

          The sample file used to demonstrate / find the bug.

          Show
          Ben Griffin added a comment - The sample file used to demonstrate / find the bug.
          Hide
          Ben Griffin added a comment -

          XMLUTF8Transcoder::transcodeTo() names it's fifth character 'charsEaten', and returns the value of just how many characters were successfully transcoded before hitting the end-buffer.
          However, TranscodeToStr::transcode() calls transcodeTo with the same parameter named as 'charsRead', and expects a value greater than zero. This is clearly a mistake, as a single character that requires more than one byte will not be eaten, even though it was 'read'.

          As I see it, the Throw at line 624 of TransService.cpp is unnecessary. There are obviously cases where there are no characters 'read' because there isn't enough memory to read them yet. The Transservice should be able to rely upon the transcodeTo() method to handle exceptions, rather than just be 'surprised' at what it gets back.

          Therefore, there is no error when no bytes are 'read' - instead the memory should be increased and the transcoding should continue.

          OTOH, it seems reasonable for transcode() to test for a zero length string before callint transcodeTo().

          Just my humble 2¢.

          Show
          Ben Griffin added a comment - XMLUTF8Transcoder::transcodeTo() names it's fifth character 'charsEaten', and returns the value of just how many characters were successfully transcoded before hitting the end-buffer. However, TranscodeToStr::transcode() calls transcodeTo with the same parameter named as 'charsRead', and expects a value greater than zero. This is clearly a mistake, as a single character that requires more than one byte will not be eaten, even though it was 'read'. As I see it, the Throw at line 624 of TransService.cpp is unnecessary. There are obviously cases where there are no characters 'read' because there isn't enough memory to read them yet. The Transservice should be able to rely upon the transcodeTo() method to handle exceptions, rather than just be 'surprised' at what it gets back. Therefore, there is no error when no bytes are 'read' - instead the memory should be increased and the transcoding should continue. OTOH, it seems reasonable for transcode() to test for a zero length string before callint transcodeTo(). Just my humble 2¢.
          Hide
          Ben Griffin added a comment -

          This fix appears to work fine for me.

          Show
          Ben Griffin added a comment - This fix appears to work fine for me.
          Hide
          Boris Kolpackov added a comment -

          Ben, is this only a problem in TranscodeToStr or can you also get this by parsing/serializing XML? It is my understanding that this only affects TranscodeToStr.

          Show
          Boris Kolpackov added a comment - Ben, is this only a problem in TranscodeToStr or can you also get this by parsing/serializing XML? It is my understanding that this only affects TranscodeToStr.
          Hide
          Ben Griffin added a comment - - edited

          Hi Boris,

          I'm pretty sure that any serializer that uses TranscodeToStr::transcode(const XMLCh in, XMLSize_t len, XMLTranscoder trans) will have this problem when the nature of the encoding that the transcoder is for is such that characters have variable sizes, most especially when the number of bytes needed to transcode a character is greater than the number of bytes used by the existing encoding. The problem is most easily exposed by the patch. Essentially, the failure happens because there isn't enough memory given to return any bytes eaten - even though there is a need to eat them.

          So when using UCS2 -> UTF-8, there is no problem until you get to 3-byte or more UTF-8 encodings: characters larger than U+0x0800. When there is a single character to be transcoded then the initial allocSize is not going to be large enough to hold that one character, so the transcoder will return 0 'charsRead'.

          This error was exposed to me when querying attributes that were set with single character Unicode values from around U+2500.

          My code was doing something like...
          DOMAttr* enoda = enod->getAttributeNode(a_name);
          const XMLCh* x_attrval = enoda->getNodeValue();
          if (x_attrval != NULL && x_attrval[0] != 0 )

          { std::string attrval; char* value = (char*)TranscodeToStr(x_attrval,"UTF-8").adopt(); }

          I am not sure whether or not the supplied serializer uses TranscodeToStr in that sort of way - you are probably better informed than me about that.
          Maybe the component that I put the bug under shouldn't be 'Utilities' ? I'm not sure that I understand why you are interested in whether it affects parsing/serializing? It certainly affects being able to use TranscodeToStr::transcode(). I don't believe that the error is in XMLUTF8Transcoder::transcodeTo(), because AFAIK it doesn't have storage for semi-consumed characters. I believe that the error is with TranscodeToStr::transcode().

          Show
          Ben Griffin added a comment - - edited Hi Boris, I'm pretty sure that any serializer that uses TranscodeToStr::transcode(const XMLCh in, XMLSize_t len, XMLTranscoder trans) will have this problem when the nature of the encoding that the transcoder is for is such that characters have variable sizes, most especially when the number of bytes needed to transcode a character is greater than the number of bytes used by the existing encoding. The problem is most easily exposed by the patch. Essentially, the failure happens because there isn't enough memory given to return any bytes eaten - even though there is a need to eat them. So when using UCS2 - > UTF-8, there is no problem until you get to 3-byte or more UTF-8 encodings: characters larger than U+0x0800. When there is a single character to be transcoded then the initial allocSize is not going to be large enough to hold that one character, so the transcoder will return 0 'charsRead'. This error was exposed to me when querying attributes that were set with single character Unicode values from around U+2500. My code was doing something like... DOMAttr* enoda = enod->getAttributeNode(a_name); const XMLCh* x_attrval = enoda->getNodeValue(); if (x_attrval != NULL && x_attrval [0] != 0 ) { std::string attrval; char* value = (char*)TranscodeToStr(x_attrval,"UTF-8").adopt(); } I am not sure whether or not the supplied serializer uses TranscodeToStr in that sort of way - you are probably better informed than me about that. Maybe the component that I put the bug under shouldn't be 'Utilities' ? I'm not sure that I understand why you are interested in whether it affects parsing/serializing? It certainly affects being able to use TranscodeToStr::transcode(). I don't believe that the error is in XMLUTF8Transcoder::transcodeTo(), because AFAIK it doesn't have storage for semi-consumed characters. I believe that the error is with TranscodeToStr::transcode().
          Hide
          Boris Kolpackov added a comment -

          Ben, the reason I am interested in whether this affects parsing/serialization is because if it does then we have a critical issue and a bug-fix release is probably needed. If it only affects TranscodeToStr, which is only used in a few net accessors, then this is not critical (it is highly unlikely a URI will contain a three-byte UTF-8 sequence) and can wait until the normal release cycle. I think we both agree this only affects TranscodeToStr. Scheduling this bug for 3.1.2/3.2.0.

          Show
          Boris Kolpackov added a comment - Ben, the reason I am interested in whether this affects parsing/serialization is because if it does then we have a critical issue and a bug-fix release is probably needed. If it only affects TranscodeToStr, which is only used in a few net accessors, then this is not critical (it is highly unlikely a URI will contain a three-byte UTF-8 sequence) and can wait until the normal release cycle. I think we both agree this only affects TranscodeToStr. Scheduling this bug for 3.1.2/3.2.0.
          Hide
          Ben Griffin added a comment -

          Demoted priority. Boris, thanks for your explanation. Forgive me for misprioritising.

          Show
          Ben Griffin added a comment - Demoted priority. Boris, thanks for your explanation. Forgive me for misprioritising.
          Hide
          Lee Doron added a comment -

          More comprehensive fix.

          Show
          Lee Doron added a comment - More comprehensive fix.
          Hide
          Lee Doron added a comment -

          I've attached a patch that addresses this bug and another I discovered. First, with regard to the issue at hand, it seems to me that an empty string (len == 0) should be "transcoded", with the result being another zero-terminated empty string. Otherwise the caller has an undue burden to examine the string before attempting to transcode it. Also, the Throw at line 624 is warranted, in case the input XMLCh string is malformed (in my book, that includes having a premature zero before len characters). So, I avoid an early exit. Instead, I add enough space to allocSize for the 4 terminating zeroes, which has two beneficial effects – in some cases it avoids a reallocation, and it also guarantees enough space for at least one UTF-8 transcoded character, so we can safely keep the Throw. However, if the input string is empty, we just skip calling transcodeTo().

          I applied a similar fix to TranscodeFromStr::transcode(), and that's where I found an entirely different bug. When it needs to reallocate, it does a memcpy(newBuf, fString, fCharsWritten) to copy the existing partial string to the new, larger buffer. However, memcpy() takes a count in units of bytes, while fCharsWritten is a count of XMLCh! The call should be memcpy(newBuf, fString, fCharsWritten * sizeof(XMLCh)).

          I made a couple of other minor changes to improve readability and optimize.

          Show
          Lee Doron added a comment - I've attached a patch that addresses this bug and another I discovered. First, with regard to the issue at hand, it seems to me that an empty string (len == 0) should be "transcoded", with the result being another zero-terminated empty string. Otherwise the caller has an undue burden to examine the string before attempting to transcode it. Also, the Throw at line 624 is warranted, in case the input XMLCh string is malformed (in my book, that includes having a premature zero before len characters). So, I avoid an early exit. Instead, I add enough space to allocSize for the 4 terminating zeroes, which has two beneficial effects – in some cases it avoids a reallocation, and it also guarantees enough space for at least one UTF-8 transcoded character, so we can safely keep the Throw. However, if the input string is empty, we just skip calling transcodeTo(). I applied a similar fix to TranscodeFromStr::transcode(), and that's where I found an entirely different bug. When it needs to reallocate, it does a memcpy(newBuf, fString, fCharsWritten) to copy the existing partial string to the new, larger buffer. However, memcpy() takes a count in units of bytes, while fCharsWritten is a count of XMLCh! The call should be memcpy(newBuf, fString, fCharsWritten * sizeof(XMLCh)). I made a couple of other minor changes to improve readability and optimize.
          Hide
          Alberto Massari added a comment -

          A fix is in SVN; please verify

          Show
          Alberto Massari added a comment - A fix is in SVN; please verify
          Hide
          Ben Griffin added a comment -

          That certainly fixed the problem that I saw.
          I (for one) am happy that this is resolved.

          Thanks very much,

          Show
          Ben Griffin added a comment - That certainly fixed the problem that I saw. I (for one) am happy that this is resolved. Thanks very much,

            People

            • Assignee:
              Alberto Massari
              Reporter:
              Ben Griffin
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development