Description
A query against an Avro table can be quite slow when all are true:
- There are many columns in the Avro file
- The query contains a wide projection
- There are many splits in the input
- Some of the splits are read serially (e.g., less executors than there are tasks)
A write to an Avro table can be quite slow when all are true:
- There are many columns in the new rows
- The operation is creating many files
For example, a single-threaded query against a 6000 column Avro data set with 50K rows and 20 files takes less than a minute with Spark 3.0.1 but over 7 minutes with Spark 3.2.0-SNAPSHOT.
The culprit appears to be this line of code:
https://github.com/apache/spark/blob/3fb044e043a2feab01d79b30c25b93d4fd166b12/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala#L226
For each split, AvroDeserializer will call this function once for each column in the projection, resulting in a potential n^2 lookup per split.
For each file, AvroSerializer will call this function once for each column, resulting in an n^2 lookup per file.