Affects Version/s: 2.8.1
Fix Version/s: None
For Apache Hive, VARCHAR fields are much slower than STRING fields when a precision (string length cap) is included. Keep in mind that this precision is the number of UTF-8 characters in the string, not the number of bytes.
The general procedure is:
- Load an entire byte buffer into a Text object
- Convert it to a String
- Count N number of character code points
- Substring the String at the correct place
- Convert the String back into a byte array and populate the Text object
It would be great if the Text object could offer a truncate/substring method based on character count that did not require copying data around. Along the same lines, a "getCharacterLength()" method may also be useful to determine if the precision has been exceeded.