Details
-
Question
-
Status: Closed
-
Major
-
Resolution: Fixed
-
3.3.0
-
None
-
None
-
spark3.3.0
Description
Questioin:
When using OptimizeSkewInRebalancePartitions to insert dynamic partitions (three-level partitions) into the hive table (partitions are skewed), it is found that when spark.sql.shuffle.partitions is set to a relatively large value (10000), the written results do not follow the preset advisoryPartitionSizeInBytes Size to file (the skewed partition data is only processed by one task and written into one file), but when I reduce spark.sql.shuffle.partitions (2000), I found that the skewed partition can be optimized according to OptimizeSkewInRebalancePartitions Data is processed in batches and written to a file.
spark aqe config:
spark.sql.adaptive.coalescePartitions.enabled true
spark.sql.adaptive.skewedJoin.enabled true
spark.sql.adaptive.advisoryPartitionSizeInBytes 128M
spark.sql.finalStage.adaptive.advisoryPartitionSizeInBytes 512M
spark.sql.finalStage.adaptive.coalescePartitions.minPartitionSize 128M
spark.sql.finalStage.adaptive.coalescePartitions.parallelismFirst false
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes 1024M
10000 partitions
2000 partition:
sql time
plan: