Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-1190

C++ json parser fails to decode multibyte unicode code points

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.7.0
    • 1.9.0
    • c++
    • None

    Description

      The parser in JsonIO.cc does not handle decoding a multibyte unicode character into any kind of valid character encoding for a std::string in c++. The following snippet from JsonParser::tryString() has several flaws:

      1. sv is a std::string used as a vector, where each unit is a char
      2. a single unicode hex quad encoded in JSON can represent a 16-bit value
      3. a unicode hex quad can represent a "high surrogate" character meaning that it must be combined with the following quad to derive the full unicode code point
      4. \U is not a valid unicode escape for JSON (see http://www.ietf.org/rfc/rfc4627.txt)

      JsonIO.cc
                  case 'u':
                  case 'U':
                      {
                          unsigned int n = 0;
                          char e[4];
                          in_.readBytes(reinterpret_cast<uint8_t*>(e), 4);
                          for (int i = 0; i < 4; i++) {
                              n *= 16;
                              char c = e[i];
                              if (isdigit(c)) {
                                  n += c - '0';
                              } else if (c >= 'a' && c <= 'f') {
                                  n += c - 'a' + 10;
                              } else if (c >= 'A' && c <= 'F') {
                                  n += c - 'A' + 10;
                              } else {
                                  throw unexpected(c);
                              }
                          }
                          sv.push_back(n);
                      }
      

      This code loop creates a temporary int then decodes the quad into it and then simply pushes the int (which may be a 16-bit value) onto the std::string. This essentially means that the JSON parser does not decode any unicode characters. For example, this JSON string:

      "Dress up if you dare! Free cover all night! \uD83C\uDF83\uD83D\uDC7B"
      

      results in a decoded byte sequence for the last 4 characters:

      3C 83 3D 7B 00
      

      where you can see that it simply drops the high order bytes. In this particular example, \uD83C is a high-surrogate character which requires some additional handling. I am not sure what users of the c++ library expect the encoding to be, but given that we are working with json and given that avro c++ uses char instead of wchar, I would assume users would expect a UTF-8 encoded string. However, I could be wrong. There are many examples of decoders that handle this string properly - I found this one helpful while implementing a fix: http://rishida.net/tools/conversion/

      For basics on UTF-8 http://www.utf-8.com/

      Attachments

        Issue Links

          Activity

            People

              thiru_mg Thiruvalluvan M. G.
              kehli Keh-Li Sheng
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: