[PARQUET-1407] Data loss on duplicate values with AvroParquetWriter/Reader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.9.0, 1.10.0, 1.8.3
Fix Version/s: 1.11.0
Component/s: parquet-avro
Labels:
- pull-request-available

Description

public class Blah {

  private static Path parquetFile = new Path("oops");
  private static Schema schema = SchemaBuilder.record("spark_schema")
      .fields().optionalBytes("value").endRecord();

  private static GenericData.Record recordFor(String value) {
    return new GenericRecordBuilder(schema)
        .set("value", value.getBytes()).build();
  }

  public static void main(String ... args) throws IOException {
    try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter
          .<GenericData.Record>builder(parquetFile)
          .withSchema(schema)
          .build()) {
      writer.write(recordFor("one"));
      writer.write(recordFor("two"));
      writer.write(recordFor("three"));
      writer.write(recordFor("three"));
      writer.write(recordFor("two"));
      writer.write(recordFor("one"));
      writer.write(recordFor("zero"));
    }

    try (ParquetReader<GenericRecord> reader = AvroParquetReader
        .<GenericRecord>builder(parquetFile)
        .withConf(new Configuration()).build()) {
      GenericRecord rec;
      int i = 0;
      while ((rec = reader.read()) != null) {
        ByteBuffer buf = (ByteBuffer) rec.get("value");
        byte[] bytes = new byte[buf.remaining()];
        buf.get(bytes);
        System.out.println("rec " + i++ + ": " + new String(bytes));
      }
    }
  }
}

Expected output:

rec 0: one
rec 1: two
rec 2: three
rec 3: three
rec 4: two
rec 5: one
rec 6: zero

Actual:

rec 0: one
rec 1: two
rec 2: three
rec 3: 
rec 4: 
rec 5: 
rec 6: zero

This was found when we started getting empty byte[] values back in spark unexpectedly. (Spark 2.3.1 and Parquet 1.8.3). I have not tried to reproduce with parquet 1.9.0, but its a bad enough bug that I would like a 1.8.4 release that I can drop-in replace 1.8.3 without any binary compatibility issues.

Duplicate byte[] values are lost.

A few clues:

If I do not call ByteBuffer.get, the size of ByteBuffer.remaining does not go to zero. I suspect a ByteBuffer is being recycled, but the call to ByteBuffer.get mutates it. I wonder if an appropriately placed ByteBuffer.duplicate() would fix it.

Attachments

Issue Links

links to

GitHub Pull Request #551

GitHub Pull Request #552

Activity

People

Assignee:: Nándor Kollár

Reporter:: Scott Carey

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 29/Aug/18 23:32

Updated:: 19/Nov/18 22:16

Resolved:: 19/Nov/18 22:16