Details
-
Sub-task
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
As discussed in SOLR-7927, we can reduce the buffer memory allocated by JavaBinCodec while writing large strings.
The maximum Unicode code point (as of Unicode 8 anyway) is U+10FFFF (http://www.unicode.org/glossary/#code_point). This is encoded in UTF-16 as surrogate pair \uDBFF\uDFFF, which takes up two Java chars, and is represented in UTF-8 as the 4-byte sequence F4 8F BF BF. This is likely where the mistaken 4-bytes-per-Java-char formulation came from: the maximum number of UTF-8 bytes required to represent a Unicode code point is 4.
The maximum Java char is \uFFFF, which is represented in UTF-8 as the 3-byte sequence EF BF BF.
So I think it's safe to switch to using 3 bytes per Java char (the unit of measurement returned by String.length()), like CompressingStoredFieldsWriter.writeField() does.