Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.11.0, 1.8.3
-
None
-
None
Description
Please see sample code below:
Schema schema = new Schema.Parser().parse(""" { "type": "record", "name": "person", "fields": [ { "name": "address", "type": [ "null", { "type": "array", "items": "string" } ], "default": null } ] } """ ); ParquetWriter<GenericRecord> writer = AvroParquetWriter.<GenericRecord>builder(new org.apache.hadoop.fs.Path("/tmp/person.parquet")) .withSchema(schema) .build(); try { // To trigger exception, add array with null element. writer.write(new GenericRecordBuilder(schema).set("address", Arrays.asList("first", null, "last")).build()); } catch (Exception e) { e.printStackTrace(); // "java.lang.NullPointerException: Array contains a null element at 1" } try { // At this point all future calls to writer.write will fail writer.write(new GenericRecordBuilder(schema).set("address", Arrays.asList("foo", "bar")).build()); } catch (Exception e) { e.printStackTrace(); // "org.apache.parquet.io.InvalidRecordException: 1(r) > 0 ( schema r)" } writer.close();
It seems to me this is caused by state not being reset between writes. Is this the indented behavior of the writer? And if so, does one have to create a new writer whenever a write fails?
I'm able to reproduce this using both parquet 1.8.3 and 1.11.0, and have attached a sample parquet file for each version.