Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11583

Make MapStatus use less memory uage

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.6.0
    • Scheduler, Spark Core
    • None

    Description

      In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I said, using BitSet can save ≈20% memory usage compared to RoaringBitMap.

      For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean.

      Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
      So if we use a HashSet[Int] to store reduceId (when non-empty blocks are dense,use reduceId of empty blocks; when sparse, use non-empty ones).

      For dense cases: if HashSet[Int](numNonEmptyBlocks).size < BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
      For sparse cases: if HashSet[Int](numEmptyBlocks).size < BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks

      sparse case, 299/300 are empty
      sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5)
      dense case, no block is empty
      sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5)

      Attachments

        Activity

          People

            yao Kent Yao
            Qin Yao Kent Yao 2
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: