[SPARK-35817] Queries against wide Avro tables can be slow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.1, 3.1.2, 3.2.0
Fix Version/s: 3.2.0, 3.1.3
Component/s: SQL
Labels:
None

Description

A query against an Avro table can be quite slow when all are true:

There are many columns in the Avro file
The query contains a wide projection
There are many splits in the input
Some of the splits are read serially (e.g., less executors than there are tasks)

A write to an Avro table can be quite slow when all are true:

There are many columns in the new rows
The operation is creating many files

For example, a single-threaded query against a 6000 column Avro data set with 50K rows and 20 files takes less than a minute with Spark 3.0.1 but over 7 minutes with Spark 3.2.0-SNAPSHOT.

The culprit appears to be this line of code:
https://github.com/apache/spark/blob/3fb044e043a2feab01d79b30c25b93d4fd166b12/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala#L226

For each split, AvroDeserializer will call this function once for each column in the projection, resulting in a potential n^2 lookup per split.

For each file, AvroSerializer will call this function once for each column, resulting in an n^2 lookup per file.

Attachments

Issue Links

links to

[Github] Pull Request #32969 (bersprockets)

[Github] Pull Request #33072 (bersprockets)

Activity

People

Assignee:: Bruce Robbins

Reporter:: Bruce Robbins

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Jun/21 20:41

Updated:: 25/Jun/21 04:15

Resolved:: 25/Jun/21 04:15