Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I said, using BitSet can save ≈20% memory usage compared to RoaringBitMap.
For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean.
Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
So if we use a HashSet[Int] to store reduceId (when non-empty blocks are dense,use reduceId of empty blocks; when sparse, use non-empty ones).
For dense cases: if HashSet[Int](numNonEmptyBlocks).size < BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
For sparse cases: if HashSet[Int](numEmptyBlocks).size < BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
sparse case, 299/300 are empty
sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5)
dense case, no block is empty
sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5)
Attachments
Issue Links
- duplicates
-
SPARK-11271 MapStatus too large for driver
- Closed
- links to