Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.6.0
-
None
-
None
Description
The following test case shows the problem:
byte[] bytes = new byte[] { 49 }; BinaryStatistics reusableStats = new BinaryStatistics(); reusableStats.updateStats(Binary.fromByteArray(bytes)); bytes[0] = 50; reusableStats.updateStats(Binary.fromByteArray(bytes, 0, 1)); assertArrayEquals(new byte[] { 49 }, reusableStats.getMinBytes()); assertArrayEquals(new byte[] { 50 }, reusableStats.getMaxBytes());
I discovered the bug when converting an AVRO file to a Parquet file by reading GenericRecords from a file using DataFileStream.next(D reuse) method. The problem is that underlying byte array of avro Utf8 object is passed to parquet that saves it as part of BinaryStatistics and then the same array is modified in place on the next read.
I am not sure what is the right way to fix the problem (in BinaryStatistics or AvroWriteSupport).
If BinaryStatistics implementation is correct (for performance reasons) then this behavior should be documented and AvroWriteSupport.fromAvroString should be fixed to duplicate underlying Utf8 array.
I am happy to create a pull request once the desired way to fix the issue is discussed.
Attachments
Issue Links
- duplicates
-
PARQUET-251 Binary column statistics error when reuse byte[] among rows
- Resolved