Qpid
  1. Qpid
  2. QPID-2452

Inconsistent handling on strings between C++ and Python messaging APIs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.7
    • Component/s: C++ Client, Python Client
    • Labels:
      None

      Description

      Description of problem:

      This bug is in reference to the new messaging APIs.

      The handling of strings is different between the C++ and Python messaging APIs.
      The Python API assumes strings on-the-wire are UTF-8 encoded. The C++ API
      apparently uses raw, unencoded octet arrays.

      If a binary string of octets (with some octets > 0x7F), is encoded by a C++
      client and received by a Python client, the Python client will throw an
      exception.

      Version-Release number of selected component (if applicable):

      SVN revision 924529 and prior.

      How reproducible:

      100%

      Steps to Reproduce:
      1. Use the C++ API (qpid::messaging) to produce a map-message (using
      MapContent). One of the map entries should have a string value and should
      contain a sequence such as "!E\xf9\xf5\xdf\x89d\x011\xc0\xc8$7H\x99T"
      2. Use the python client to receive the message.

      Actual results:

      The Python client will throw an exception when it tries to UTF8-decode the
      string.

      Expected results:

      Either Python should use raw-octet encoding or C++ should use UTF8. I expect
      the string seen by the Python receiver to be identical to that sent by the C++
      producer.

        Activity

        Hide
        Rafael H. Schloming added a comment -

        The python client doesn't assume anything about the strings on the wire, it just looks at the typecode. If the typecode says it is UTF-8 data then it will decode as UTF-8. If the tyepcode says it is raw binary then it will decode as raw binary. From what you're describing it sounds to me like we're somehow putting raw octets onto the wire but incorrectly tagging them as UTF-8.

        Show
        Rafael H. Schloming added a comment - The python client doesn't assume anything about the strings on the wire, it just looks at the typecode. If the typecode says it is UTF-8 data then it will decode as UTF-8. If the tyepcode says it is raw binary then it will decode as raw binary. From what you're describing it sounds to me like we're somehow putting raw octets onto the wire but incorrectly tagging them as UTF-8.
        Hide
        Ted Ross added a comment -

        You are correct. The tagging on the wire is 0x95 (which is called str16 and is, according to the spec, utf8 encoded).

        The C++ API does not distinguish between str16 and vbin16. It only supplies a VAR_STRING type. So, if it is going to encode strings into str16 values, it must utf8 encode. Alternatively, it could use the vbin16 tag.

        -Ted

        Show
        Ted Ross added a comment - You are correct. The tagging on the wire is 0x95 (which is called str16 and is, according to the spec, utf8 encoded). The C++ API does not distinguish between str16 and vbin16. It only supplies a VAR_STRING type. So, if it is going to encode strings into str16 values, it must utf8 encode. Alternatively, it could use the vbin16 tag. -Ted
        Hide
        Gordon Sim added a comment -

        The amqp 0-10 codec now correctly looks at the encoding of a string valued variant in deciding which type to encode it as. By default it will send it as vbin16 (vbin32 if too large fr that). However by setting the encoding to "utf8", "utf16" or "iso-8859-15" the user can alter that if desired. However the user is responsible for ensuring that the string is indeed valid for the specified encoding. No conversion is done at present.

        Show
        Gordon Sim added a comment - The amqp 0-10 codec now correctly looks at the encoding of a string valued variant in deciding which type to encode it as. By default it will send it as vbin16 (vbin32 if too large fr that). However by setting the encoding to "utf8", "utf16" or "iso-8859-15" the user can alter that if desired. However the user is responsible for ensuring that the string is indeed valid for the specified encoding. No conversion is done at present.

          People

          • Assignee:
            Gordon Sim
            Reporter:
            Ted Ross
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development