Details
-
Bug
-
Status: Triage Needed
-
Normal
-
Resolution: Unresolved
-
None
-
None
-
All
-
None
Description
Within a query, we have sent in a character which is \U0010FFFF - the highest permissible unicode character point. This is encoded in UTF-8 using 4 bytes and sent. When the query issues a warning in the response (such as a tombstone warning which includes the query sent), the warning string in the protocol is specified as a short , followed by the string.
CBUtil.WriteString gets the length using the following code:
int length = TypeSizes.encodedUTF8Length(str);
This in turn gets the length of the string based on a calculation:
public static int encodedUTF8Length(String st) { int strlen = st.length(); int utflen = 0; for (int i = 0; i < strlen; i++) { int c = st.charAt(i); if ((c >= 0x0001) && (c <= 0x007F)) utflen++; else if (c > 0x07FF) utflen += 3; else utflen += 2; } return utflen; }
The use of the st.length within this function causes problems - its considering the string as utf-16, so the 4 byte UTF-8 value is treated as a 2 character utf-16 value, both of which are high values and considered to be 3 bytes in length each, making a total length of 6 bytes.
Using some test code:
import java.nio.charset.StandardCharsets; byte[] utf8Bytes = {(byte)244, (byte)143, (byte)191, (byte)191}; var st = new String(utf8Bytes, StandardCharsets.UTF_8); System.out.println(st); int strlen = st.length(); System.out.println(strlen); int utflen = 0; for (int i = 0; i < strlen; i++) { int c = st.charAt(i); if ((c >= 0x0001) && (c <= 0x007F)) utflen++; else if (c > 0x07FF) { utflen += 3; } else utflen += 2; } System.out.println(utflen); byte[] utf8Bytes = st.getBytes(StandardCharsets.UTF_8); for (byte b : utf8Bytes) { System.out.print(b & 0xFF); System.out.printf(" "); }
The 4 byte UTF-8, is seen by st.length as 2, which then considered the value of each utf-16 as 56319 and 57343 respectively, and since this is above the 2047 (0x07FF), adds 3 to the length each time.
The response message at a byte level does correctly return the UTF-8 character in as 244 143 191 191, but the incorrect length results in a buffer overread, which offsets the following reads, resulting in a few different possible errors, but all relating to misalignment of the buffer read vs expected value at that point in the buffer.
Issue specifically found in 4.1, but appears to have existed for a while - and is specifically due to operating outside of the UTF-16 BMP range but into the higher planes.
Attachments
Issue Links
- links to