[SPARK-43043] Improve the performance of MapOutputTracker.updateMapOutput - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Done
Affects Version/s: 3.3.2
Fix Version/s: 3.5.0
Component/s: Spark Core
Labels:
- pull-request-available

Description

Inside of MapOutputTracker, there is a line of code which does a linear find through a mapStatuses collection: https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L167 (plus a similar search a few lines down at https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L174)

This scan is necessary because we only know the mapId of the updated status and not its mapPartitionId.

We perform this scan once per migrated block, so if a large proportion of all blocks in the map are migrated then we get O(n^2) total runtime across all of the calls.

I think we might be able to fix this by extending ShuffleStatus to have an OpenHashMap mapping from mapId to mapPartitionId.

Attachments

Issue Links

causes

SPARK-44658 ShuffleStatus.getMapStatus should return None instead of Some(null)

Resolved

links to

[Github] Pull Request #40690 (jiangxb1987)

GitHub Pull Request #46706

Activity

People

Assignee:: Xingbo Jiang

Reporter:: Xingbo Jiang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 05/Apr/23 23:09

Updated:: 23/May/24 02:20

Resolved:: 16/May/23 18:37