Uploaded image for project: 'ActiveMQ C++ Client'
  1. ActiveMQ C++ Client
  2. AMQCPP-235

UTF8 length marshalling bug in openwire readString and writeString.

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.2.6, 3.0
    • Openwire
    • None
    • Windows XP / Visual Studio 2005

    • Patch Available

    Description

      In investigating a bug for the check "if( str->size() > 65536 )" which should be "if( str->size() > 65535 )" in writeString() , I found a couple of other problems:
      (1) The OpenwireStringSupport::readString method should read the utf8 length as an unsigned short rather than short. The problem is that utf8 encoded strings (using writeString) longer than 32768 will become truncated when read back using readString().
      (2) The writeString() method should also check the value of utflen after determining the UTF8 length of the encoded string, since with the support of characters greater than value 127, encodings of 2 UTF8 bytes per byte can exist.

      Attachments

        1. OpenwireStringSupport.cpp.patch
          2 kB
          Martin Schlapfer
        2. OpenwireStringSupportTest.cpp.patch
          2 kB
          Martin Schlapfer
        3. OpenwireStringSupportTest.h.patch
          0.7 kB
          Martin Schlapfer

        Activity

          Patch files for OpenwireStringSupport.cpp and associated test case OpenwireStringSupportTest.cpp and OpenwireStringSupportTest.h.

          mschlapf Martin Schlapfer added a comment - Patch files for OpenwireStringSupport.cpp and associated test case OpenwireStringSupportTest.cpp and OpenwireStringSupportTest.h.

          Having looked through some more of the code I still think there is a larger issue here that will need to be addressed. I'm going to play around with some other changes and test cases.

          Have you tried sending a TextMessage with a string that has ASCII values greater than 127, I'm betting that will fail to unmarshal correctly as well. This code in the String support class is currently only being used for properties.

          tabish Timothy A. Bish added a comment - Having looked through some more of the code I still think there is a larger issue here that will need to be addressed. I'm going to play around with some other changes and test cases. Have you tried sending a TextMessage with a string that has ASCII values greater than 127, I'm betting that will fail to unmarshal correctly as well. This code in the String support class is currently only being used for properties.

          With respect to a TextMessage, it looks like longer strings (>65535) are supported. The length is encoded with 4 bytes rather than 2 bytes.

          For character values greater than 127 contained in a text message:
          (1) C++ to C++ Client: the (un)marshalling works (except in the case of a null character), because there is no UTF8 encoding/decoding performed.
          (2) C++ to Java Client: the Java client throws a java.io.UTFDataFormatException when getText is called on the text message (at this point it is "unmarshalled" using UTF8 decoding, see java.org.apache.activemq.util.MarshallingSupport.readUTF8).

          So it looks like an implementation for UTF8 encoding/decoding for the TextMessage is necessary to be interoperable with the Java client with character values larger than 127.

          mschlapf Martin Schlapfer added a comment - With respect to a TextMessage, it looks like longer strings (>65535) are supported. The length is encoded with 4 bytes rather than 2 bytes. For character values greater than 127 contained in a text message: (1) C++ to C++ Client: the (un)marshalling works (except in the case of a null character), because there is no UTF8 encoding/decoding performed. (2) C++ to Java Client: the Java client throws a java.io.UTFDataFormatException when getText is called on the text message (at this point it is "unmarshalled" using UTF8 decoding, see java.org.apache.activemq.util.MarshallingSupport.readUTF8). So it looks like an implementation for UTF8 encoding/decoding for the TextMessage is necessary to be interoperable with the Java client with character values larger than 127.

          Yes, that's pretty much what I thought would happen. I'm reworking some of the code now, I should have this fixed in a few days. I will of course make sure to pick up the fixes from your patches as well.

          BTW - Thanks for adding more unit testing, I really appreciate it when folks take the time to add them, keeps regressions from popping up.

          tabish Timothy A. Bish added a comment - Yes, that's pretty much what I thought would happen. I'm reworking some of the code now, I should have this fixed in a few days. I will of course make sure to pick up the fixes from your patches as well. BTW - Thanks for adding more unit testing, I really appreciate it when folks take the time to add them, keeps regressions from popping up.

          Fixed in trunk and in the 2.x branch.

          The DataInput/OutputStream's have been modified to correctly write and read modified UTF-8 using the normal Java standard which encodes the length as an unsigned short value limiting the length to 65535 encoded bytes. The OpenWireStringSupport class has been modified to correctly read and write modified UTF-8 that encode the length as a signed integer which removes limitations previously set on the length of properties in the Messages to 65535 encoded bytes. The ActiveMQTextMessage has been modified to also use the OpenWireString support Method instead of its own version of the code so now all text should properly encode ASCII values from 0-255.

          Enhanced unit tests have also been added.

          You are welcome to review the code and comment.

          tabish Timothy A. Bish added a comment - Fixed in trunk and in the 2.x branch. The DataInput/OutputStream's have been modified to correctly write and read modified UTF-8 using the normal Java standard which encodes the length as an unsigned short value limiting the length to 65535 encoded bytes. The OpenWireStringSupport class has been modified to correctly read and write modified UTF-8 that encode the length as a signed integer which removes limitations previously set on the length of properties in the Messages to 65535 encoded bytes. The ActiveMQTextMessage has been modified to also use the OpenWireString support Method instead of its own version of the code so now all text should properly encode ASCII values from 0-255. Enhanced unit tests have also been added. You are welcome to review the code and comment.

          Hi Tim,
          I'm out this week on vacation (checking email periodically). Next week, I'll take a close look at the changes and provide any feedback. Thanks!

          • Martin.
          mschlapf Martin Schlapfer added a comment - Hi Tim, I'm out this week on vacation (checking email periodically). Next week, I'll take a close look at the changes and provide any feedback. Thanks! Martin.

          Tim,

          Great job. I like that you refactored the string utf8 marshalling into the Data[Input|Output]Stream class to align with Java and used the OpenwireStringSupport class to handle the large string utf8 marshalling. You also used modified utf8 encoding, very nice.

          Ideally the utf8 marshalling code could be contained in one spot regardless of integer (large string) or unsigned short length. This would avoid code duplication in the two locations and thus less chance for a bug to be introduced at some later point. One could do this, for example, by writing or reading the integer or unsigned short length accordingly, and then using a common piece of code to write / read the utf8 encoding. That's my only comment.

          Looks great!

          Cheers,
          Martin.

          mschlapf Martin Schlapfer added a comment - Tim, Great job. I like that you refactored the string utf8 marshalling into the Data [Input|Output] Stream class to align with Java and used the OpenwireStringSupport class to handle the large string utf8 marshalling. You also used modified utf8 encoding, very nice. Ideally the utf8 marshalling code could be contained in one spot regardless of integer (large string) or unsigned short length. This would avoid code duplication in the two locations and thus less chance for a bug to be introduced at some later point. One could do this, for example, by writing or reading the integer or unsigned short length accordingly, and then using a common piece of code to write / read the utf8 encoding. That's my only comment. Looks great! Cheers, Martin.

          Thanks for the input. Combining the code would have been ideal but I'm planning on splitting the decaf library out at some point into its own lib for use on some other projects so the AMQCPP client can't depend on that code so I kept them separate.

          tabish Timothy A. Bish added a comment - Thanks for the input. Combining the code would have been ideal but I'm planning on splitting the decaf library out at some point into its own lib for use on some other projects so the AMQCPP client can't depend on that code so I kept them separate.

          People

            tabish Timothy A. Bish
            mschlapf Martin Schlapfer
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: