[CASSANDRA-19537] Unicode Code Points incorrectly sized in protocol response - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Triage Needed
Priority: Normal
Resolution: Unresolved
Fix Version/s: None
Component/s: CQL/Interpreter
Labels:
None

Platform:

All
Impacts:

None

Description

Within a query, we have sent in a character which is \U0010FFFF - the highest permissible unicode character point. This is encoded in UTF-8 using 4 bytes and sent. When the query issues a warning in the response (such as a tombstone warning which includes the query sent), the warning string in the protocol is specified as a short , followed by the string.

CBUtil.WriteString gets the length using the following code:

int length = TypeSizes.encodedUTF8Length(str);

This in turn gets the length of the string based on a calculation:

public static int encodedUTF8Length(String st)
{
int strlen = st.length();
int utflen = 0;
for (int i = 0; i < strlen; i++)
{
int c = st.charAt(i);
if ((c >= 0x0001) && (c <= 0x007F))
utflen++;
else if (c > 0x07FF)
utflen += 3;
else
utflen += 2;
}
return utflen;
}

The use of the st.length within this function causes problems - its considering the string as utf-16, so the 4 byte UTF-8 value is treated as a 2 character utf-16 value, both of which are high values and considered to be 3 bytes in length each, making a total length of 6 bytes.

Using some test code:

import java.nio.charset.StandardCharsets;
byte[] utf8Bytes = {(byte)244, (byte)143, (byte)191, (byte)191};
var st = new String(utf8Bytes, StandardCharsets.UTF_8);
System.out.println(st);
int strlen = st.length();
System.out.println(strlen);
int utflen = 0;
for (int i = 0; i < strlen; i++)
{
  int c = st.charAt(i);
  if ((c >= 0x0001) && (c <= 0x007F))
    utflen++;
  else if (c > 0x07FF) {
    utflen += 3;
  }
  else
    utflen += 2;
}
System.out.println(utflen);
byte[] utf8Bytes = st.getBytes(StandardCharsets.UTF_8);
for (byte b : utf8Bytes) {
  System.out.print(b & 0xFF);
  System.out.printf(" ");
}

The 4 byte UTF-8, is seen by st.length as 2, which then considered the value of each utf-16 as 56319 and 57343 respectively, and since this is above the 2047 (0x07FF), adds 3 to the length each time.

The response message at a byte level does correctly return the UTF-8 character in as 244 143 191 191, but the incorrect length results in a buffer overread, which offsets the following reads, resulting in a few different possible errors, but all relating to misalignment of the buffer read vs expected value at that point in the buffer.

Issue specifically found in 4.1, but appears to have existed for a while - and is specifically due to operating outside of the UTF-16 BMP range but into the higher planes.

Attachments

Issue Links

links to

GitHub Pull Request #3235

Activity

People

Assignee:: Unassigned

Reporter:: Andrew Hogg

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Apr/24 11:20

Updated:: 08/Apr/24 13:50

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10m