Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-19537

Unicode Code Points incorrectly sized in protocol response

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Triage Needed
    • Normal
    • Resolution: Unresolved
    • None
    • CQL/Interpreter
    • None
    • All
    • None

    Description

      Within a query, we have sent in a character which is \U0010FFFF - the highest permissible unicode character point. This is encoded in UTF-8 using 4 bytes and sent. When the query issues a warning in the response (such as a tombstone warning which includes the query sent), the warning string in the protocol is specified as a short , followed by the string.

       

      CBUtil.WriteString gets the length using the following code:

      int length = TypeSizes.encodedUTF8Length(str);

      This in turn gets the length of the string based on a calculation:

      public static int encodedUTF8Length(String st)
      {
      int strlen = st.length();
      int utflen = 0;
      for (int i = 0; i < strlen; i++)
      {
      int c = st.charAt(i);
      if ((c >= 0x0001) && (c <= 0x007F))
      utflen++;
      else if (c > 0x07FF)
      utflen += 3;
      else
      utflen += 2;
      }
      return utflen;
      }

      The use of the st.length within this function causes problems - its considering the string as utf-16, so the 4 byte UTF-8 value is treated as a 2 character utf-16 value, both of which are high values and considered to be 3 bytes in length each, making a total length of 6 bytes.
       
      Using some test code:

      import java.nio.charset.StandardCharsets;
      byte[] utf8Bytes = {(byte)244, (byte)143, (byte)191, (byte)191};
      var st = new String(utf8Bytes, StandardCharsets.UTF_8);
      System.out.println(st);
      int strlen = st.length();
      System.out.println(strlen);
      int utflen = 0;
      for (int i = 0; i < strlen; i++)
      {
        int c = st.charAt(i);
        if ((c >= 0x0001) && (c <= 0x007F))
          utflen++;
        else if (c > 0x07FF) {
          utflen += 3;
        }
        else
          utflen += 2;
      }
      System.out.println(utflen);
      byte[] utf8Bytes = st.getBytes(StandardCharsets.UTF_8);
      for (byte b : utf8Bytes) {
        System.out.print(b & 0xFF);
        System.out.printf(" ");
      }
      

      The 4 byte UTF-8, is seen by st.length as 2, which then considered the value of each utf-16  as 56319 and 57343 respectively, and since this is above the 2047 (0x07FF),  adds 3 to the length each time.

      The response message at a byte level does correctly return the UTF-8 character in as 244 143 191 191, but the incorrect length results in a buffer overread, which offsets the following reads, resulting in a few different possible errors, but all relating to misalignment of the buffer read vs expected value at that point in the buffer.

       

      Issue specifically found in 4.1, but appears to have existed for a while - and is specifically due to operating outside of the UTF-16 BMP range but into the higher planes.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              adhogg Andrew Hogg
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m