Details
Description
Short Description
When reading complex union types from Avro files, there seems to be some information lost as the name of the record is omitted and member$i is instead returned.
Long Description
Error
Given the Avro schema schema.avsc, I would expected the schema when reading the avro file using read_avro.py to be as expected.txt. Instead, I get the schema output in reality.txt where RecordOne became member0, etc.
This causes information lost and makes the DataFrame unusable.
From my understanding this behavior was implemented here.
read_avro.py
df = spark.read.format("avro").load("path/to/my/file.avro") df.printSchema()
schema.avsc
{ "type": "record", "name": "SomeData", "namespace": "my.name.space", "fields": [ { "name": "ts", "type": { "type": "long", "logicalType": "timestamp-millis" } }, { "name": "field_id", "type": [ "null", "string" ], "default": null }, { "name": "values", "type": [ { "type": "record", "name": "RecordOne", "fields": [ { "name": "field_a", "type": "long" }, { "name": "field_b", "type": { "type": "enum", "name": "FieldB", "symbols": [ "..." ], } }, { "name": "field_C", "type": { "type": "array", "items": "long" } } ] }, { "type": "record", "name": "RecordTwo", "fields": [ { "name": "field_a", "type": "long" } ] } ] } ] }
expected.txt
root |-- ts: timestamp (nullable = true) |-- field_id: string (nullable = true) |-- values: struct (nullable = true) | |-- RecordOne: struct (nullable = true) | | |-- field_a: long (nullable = true) | | |-- field_b: string (nullable = true) | | |-- field_c: array (nullable = true) | | | |-- element: long (containsNull = true) | |-- RecordTwo: struct (nullable = true) | | |-- field_a: long (nullable = true)
reality.txt
root |-- ts: timestamp (nullable = true) |-- field_id: string (nullable = true) |-- values: struct (nullable = true) | |-- member0: struct (nullable = true) | | |-- field_a: long (nullable = true) | | |-- field_b: string (nullable = true) | | |-- field_c: array (nullable = true) | | | |-- element: long (containsNull = true) | |-- member1: struct (nullable = true) | | |-- field_a: long (nullable = true)