Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.0
-
None
-
None
Description
Currently, Spark adaptive execution use the MapOutputStatistics information to adjust the plan dynamically, but this MapOutputSize does not respect the compression factor. So there are cases that the original SparkPlan is `SortMergeJoin`, but the Plan after adaptive adjustment was changed to `BroadcastHashJoin`, but this `BroadcastHashJoin` might causing OOMs due to inaccurate estimation.
Also, if the shuffle implementation is local shuffle(intel Spark-Adaptive execution impl), then in some cases, it will cause `Too large Frame` exception.
So I propose to respect the compression factor in adaptive execution, or use `dataSize` metrics in `ShuffleExchangeExec` in adaptive execution.
Attachments
Issue Links
- is related to
-
SPARK-24914 totalSize is not a good estimate for broadcast joins
- In Progress