Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41386

There are some small files when using rebalance(column)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.3.1
    • None
    • SQL
    • None

    Description

      Problem ( REBALANCE(column) ):

       SparkSession config:

      config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") 
      config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
      config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5")

      so, we except that files size should be bigger than 20m*0.5=10m at least. 

      but in fact , we got some small files like the following:

      -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 .../part-00000-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
      -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 .../part-00001-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
      -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 .../part-00002-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
      -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 .../part-00003-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
      -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 .../part-00004-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
      -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 .../part-00005-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet

      9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way.

      Attachments

        Activity

          People

            Unassigned Unassigned
            dongz Zhe Dong
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: