[SPARK-11583] Make MapStatus use less memory uage - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.6.0
Component/s: Scheduler, Spark Core
Labels:
None

Description

In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I said, using BitSet can save ≈20% memory usage compared to RoaringBitMap.

For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean.

Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
So if we use a HashSet[Int] to store reduceId (when non-empty blocks are dense,use reduceId of empty blocks; when sparse, use non-empty ones).

For dense cases: if HashSet[Int](numNonEmptyBlocks).size < BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
For sparse cases: if HashSet[Int](numEmptyBlocks).size < BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks

sparse case, 299/300 are empty
sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5)
dense case, no block is empty
sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5)

Attachments

Issue Links

duplicates

SPARK-11271 MapStatus too large for driver

Closed

links to

[Github] Pull Request #9559 (yaooqinn)

[Github] Pull Request #9661 (yaooqinn)

[Github] Pull Request #9746 (davies)

Activity

People

Assignee:: Kent Yao

Reporter:: Kent Yao 2

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 09/Nov/15 06:31

Updated:: 10/Aug/24 09:11

Resolved:: 18/Nov/15 03:45