Description
When reading an Avro dataset (using the dataset's schema or by overriding it with 'avroSchema') or writing an Avro dataset with a provided schema by 'avroSchema', currently the matching of Catalyst-to-Avro fields is done by field name.
This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at least on the write path, we would match the schemas by positionally ("structural" comparison). While I agree that this is much more sensible for default behavior, I propose that we make this behavior configurable using an option for the Avro datasource. Even at the time that SPARK-27762 was handled, there was interest in making this behavior configurable, but it appears it went unaddressed.
There is precedence for configurability of this behavior as seen in SPARK-32864, which added this support for ORC. Besides this precedence, the behavior of Hive is to perform matching positionally (ref), so this is behavior that Hadoop/Hive ecosystem users are familiar with:
Hive is very forgiving about types: it will attempt to store whatever value matches the provided column in the equivalent column position in the new table. No matching is done on column names, for instance.
Attachments
Issue Links
- is related to
-
SPARK-27762 Support user provided avro schema for writing fields with different ordering
- Resolved
-
SPARK-35918 Consolidate logic between AvroSerializer/AvroDeserializer for schema mismatch handling and error messages
- Resolved
-
SPARK-32864 Support ORC forced positional evolution
- Resolved
- relates to
-
SPARK-34378 Loosen AvroSerializer validation to allow extra nullable user-provided fields
- Resolved
- links to