Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
1.11.0
-
None
-
None
-
Linux, Apache Beam 2.28.0, Java 11
Description
I ran into what looks like a bug in the Parquet Avro reading code, around trying to read a file written with a previous version of a schema with a new, evolved version of the schema.
I'm using Apache Beam's ParquetIO library, which supports passing in schemas to use for "projection" and I was investigating if that would work for me here. However, it didn't work, complaining that my new reader schema had a field that wasn't in the writer schema.
I traced this through to a couple places in the parquet-avro code that don't look right to me:
First, in `prepareForRead` here: https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java#L116
The `parquetSchema` var comes from `parquetSchema = readContext.getRequestedSchema();` while the `avroSchema` var comes from the parquet file itself with `avroSchema = new Schema.Parser().parse(keyValueMetaData.get(AVRO_SCHEMA_METADATA_KEY));`
I can verify that `parquetSchema` is the schema I'm requesting it be projected to and that `avroSchema` is the schema from the file, but the naming looks backward, shouldn't `parquetSchema` be the one from the parquet file?
Following the stack down, I was hitting this line: https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L91
here it was failing because the `avroSchema` didn't have a field that was in the `parquetSchema`, with the variables assigned in the same way as above. That's the case I was hoping to use this projection for, though - to get the record read with the new reader schema, using the default value from the new schema for the new field. In fact, the comment on line 101 "store defaults for any new Avro fields from avroSchema that are not in the writer schema (parquetSchema)" suggests that the intent was for this to work, but the actual code has the writer schema in avroSchema and the reader schema in parquetSchema.
(Additionally, I'd want this to support schema evolution both for adding an optional field and also removing an old field - so just flipping the names around would result in this still breaking if the reader schema dropped a field from the writer schema...)
Looking to understand if I'm interpreting this correctly, or if there's another path that's intended to be used.
Thank you!