Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35817

Queries against wide Avro tables can be slow

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.1, 3.1.2, 3.2.0
    • 3.2.0, 3.1.3
    • SQL
    • None

    Description

      A query against an Avro table can be quite slow when all are true:

      • There are many columns in the Avro file
      • The query contains a wide projection
      • There are many splits in the input
      • Some of the splits are read serially (e.g., less executors than there are tasks)

      A write to an Avro table can be quite slow when all are true:

      • There are many columns in the new rows
      • The operation is creating many files

      For example, a single-threaded query against a 6000 column Avro data set with 50K rows and 20 files takes less than a minute with Spark 3.0.1 but over 7 minutes with Spark 3.2.0-SNAPSHOT.

      The culprit appears to be this line of code:
      https://github.com/apache/spark/blob/3fb044e043a2feab01d79b30c25b93d4fd166b12/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala#L226

      For each split, AvroDeserializer will call this function once for each column in the projection, resulting in a potential n^2 lookup per split.

      For each file, AvroSerializer will call this function once for each column, resulting in an n^2 lookup per file.

      Attachments

        Activity

          People

            bersprockets Bruce Robbins
            bersprockets Bruce Robbins
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: