Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Done
-
3.3.2
Description
Inside of MapOutputTracker, there is a line of code which does a linear find through a mapStatuses collection: https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L167 (plus a similar search a few lines down at https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L174)
This scan is necessary because we only know the mapId of the updated status and not its mapPartitionId.
We perform this scan once per migrated block, so if a large proportion of all blocks in the map are migrated then we get O(n^2) total runtime across all of the calls.
I think we might be able to fix this by extending ShuffleStatus to have an OpenHashMap mapping from mapId to mapPartitionId.
Attachments
Issue Links
- causes
-
SPARK-44658 ShuffleStatus.getMapStatus should return None instead of Some(null)
- Resolved
- links to