Details
-
Improvement
-
Status: In Progress
-
Minor
-
Resolution: Unresolved
-
3.3.1
-
None
-
None
Description
Problem ( REBALANCE(column) ):
SparkSession config:
config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5")
so, we except that files size should be bigger than 20m*0.5=10m at least.
but in fact , we got some small files like the following:
-rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-00000-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-00001-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-00002-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-00003-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 .../part-00004-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 .../part-00005-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way.