Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Not A Problem
-
None
-
None
-
None
Description
String statistics are collected using Text which compares using raw bytes which are assumed to be UTF-8. When the input contains invalid UTF-8 sequences and these are the min or max value, the writer converts these invalid sequences to a java.lang.String, which replaces invalid UTF-8 sequences with the replacement character (0xFFFD). This conversion happens here:
To work around this issue, the writer should use the `setMinimumBytes` Protocol Buffers API instead.
The same issue exists during read, where the bytes are round tripped through java.lang.String. The read code is here: