Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40038

spark.sql.files.maxPartitionBytes does not observe on-disk compression

    XMLWordPrintableJSON

Details

    • Question
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.0
    • None
    • Input/Output, Optimizer, PySpark, SQL
    • None

    Description

      Why does `spark.sql.files.maxPartitionBytes` estimate the number of partitions based on file size on disk instead of the uncompressed file size?

      For example I have a dataset that is 213GB on disk. When I read this in to my application I get 2050 partitions based on the default value of 128MB for maxPartitionBytes. My application is a simple broadcast index join that adds 1 column to the dataframe and writes it out. There is no shuffle.

      Initially the size of input /output records seem ok, but I still get a large amount of memory "spill" on the executors. I believe this is due to the data being highly compressed and each partition becoming too big when it is deserialized to work on in memory.

      (If I try to do a repartition immediately after reading I still see the first stage spilling memory to disk, so that is not the right solution or what I'm interested in.) 

      Instead, I attempt to lower maxPartitionBytes by the (average) compression ratio of my files (about 7x, so let's round up to 8). So I set maxPartitionBytes=16MB.  At this point  I see that spark is reading in from the file in 12-28 MB chunks. Now it makes 14316 partitions on the initial file read and completes with no spillage. 


       
      Is there something I'm missing here? Is this just intended behavior? How can I tune my partition size correctly for my application when I do not know how much the data will be compressed ahead of time?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              marcusrm RJ Marcus
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: