Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18231

Optimise SizeEstimator implementation

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.6.2, 2.0.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The SizeEstimator is used in Spark to determine whether or not we need to spill – we know spilling typically has an adverse impact on performance and it's something we want to minimise

      We can improve the implementation of SizeEstimator in a variety of ways to gain a performance and increase and ultimately a reduction in footprint by spilling less

      There are two phases involved here

      1) refactor to use more efficient data structures, to avoid some reflection calls (expensive), to remove the use of ScalaRunTime.array_apply, to use ThreadLocalRandom, to store an array of field offsets instead of a list of pointer fields and to improve the performance of the sample method

      2) add JDK specialisms to use exact object sizes to reduce overestimations for both Open/Oracle JDK users and IBM Java users. With an accurate estimator we can therefore spill less (--footprint, ++performance – we have observed a 15% reduction in RDD sizes leading to potentially double digit performance gains on HiBench and micro benchmarks)

        Issue Links

          Activity

          Hide
          apachespark Apache Spark added a comment -

          User 'a-roberts' has created a pull request for this issue:
          https://github.com/apache/spark/pull/16196

          Show
          apachespark Apache Spark added a comment - User 'a-roberts' has created a pull request for this issue: https://github.com/apache/spark/pull/16196

            People

            • Assignee:
              Unassigned
              Reporter:
              aroberts Adam Roberts
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development