Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-43043

Improve the performance of MapOutputTracker.updateMapOutput

    XMLWordPrintableJSON

Details

    Description

      Inside of MapOutputTracker, there is a line of code which does a linear find through a mapStatuses collection: https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L167 (plus a similar search a few lines down at https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L174)

      This scan is necessary because we only know the mapId of the updated status and not its mapPartitionId.

      We perform this scan once per migrated block, so if a large proportion of all blocks in the map are migrated then we get O(n^2) total runtime across all of the calls.

      I think we might be able to fix this by extending ShuffleStatus to have an OpenHashMap mapping from mapId to mapPartitionId.

      Attachments

        Issue Links

          Activity

            People

              jiangxb1987 Xingbo Jiang
              jiangxb1987 Xingbo Jiang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: