Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40038

spark.sql.files.maxPartitionBytes does not observe on-disk compression

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Question
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.0
    • None
    • Input/Output, Optimizer, PySpark, SQL
    • None

    Description

      Why does `spark.sql.files.maxPartitionBytes` estimate the number of partitions based on file size on disk instead of the uncompressed file size?

      For example I have a dataset that is 213GB on disk. When I read this in to my application I get 2050 partitions based on the default value of 128MB for maxPartitionBytes. My application is a simple broadcast index join that adds 1 column to the dataframe and writes it out. There is no shuffle.

      Initially the size of input /output records seem ok, but I still get a large amount of memory "spill" on the executors. I believe this is due to the data being highly compressed and each partition becoming too big when it is deserialized to work on in memory.

      (If I try to do a repartition immediately after reading I still see the first stage spilling memory to disk, so that is not the right solution or what I'm interested in.) 

      Instead, I attempt to lower maxPartitionBytes by the (average) compression ratio of my files (about 7x, so let's round up to 8). So I set maxPartitionBytes=16MB.  At this point  I see that spark is reading in from the file in 12-28 MB chunks. Now it makes 14316 partitions on the initial file read and completes with no spillage. 


       
      Is there something I'm missing here? Is this just intended behavior? How can I tune my partition size correctly for my application when I do not know how much the data will be compressed ahead of time?

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            marcusrm RJ Marcus

            Dates

              Created:
              Updated:

              Slack

                Issue deployment