A query against an Avro table can be quite slow when all are true:
- There are many columns in the Avro file
- The query contains a wide projection
- There are many splits in the input
- Some of the splits are read serially (e.g., less executors than there are tasks)
A write to an Avro table can be quite slow when all are true:
- There are many columns in the new rows
- The operation is creating many files
For example, a single-threaded query against a 6000 column Avro data set with 50K rows and 20 files takes less than a minute with Spark 3.0.1 but over 7 minutes with Spark 3.2.0-SNAPSHOT.
The culprit appears to be this line of code:
For each split, AvroDeserializer will call this function once for each column in the projection, resulting in a potential n^2 lookup per split.
For each file, AvroSerializer will call this function once for each column, resulting in an n^2 lookup per file.