[ORC-222] StringStatisticsImpl munges min/max during write - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: encoding
Labels:
None

Description

String statistics are collected using Text which compares using raw bytes which are assumed to be UTF-8. When the input contains invalid UTF-8 sequences and these are the min or max value, the writer converts these invalid sequences to a java.lang.String, which replaces invalid UTF-8 sequences with the replacement character (0xFFFD). This conversion happens here:

https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L611

To work around this issue, the writer should use the `setMinimumBytes` Protocol Buffers API instead.

The same issue exists during read, where the bytes are round tripped through java.lang.String. The read code is here:

https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L528

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Dain Sundstrom

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Aug/17 00:21

Updated:: 18/Jan/18 16:48

Resolved:: 18/Jan/18 16:48