Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-385

Changes to SizeEstimator more accurate

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Resolution: Fixed
    • None
    • None
    • None
    • None

    Description

      Motivation:
      This patch is motivated by an observation that the amount of heap space used by
      the BoundedMemoryCache is often much larger than what we account for.
      For example running with 10 files from the RITA dataset[1] and 4GB for the cache,
      we see the currBytes variable to be 3917.67. However analyzing the memory heap
      with Eclipse MAT[2] shows that the BoundedMemoryCache in fact occupies 4360.55 MB.

      Changes made:
      This patch tries to address the discrepancy by making some changes to the
      SizeEstimator. The object size and reference size are changed based on
      the architecture in use and if or not CompressedOops are in use by the JVM. This
      results in the object size changing from 8 to 12 or 16 and references being
      either 4 or 8 bytes long. We also account for the fact that arrays have an
      object header + an int for the length. Lastly, this patch also account for the
      fact that fields and objects are aligned to 8-byte boundaries by the JVM.
      Changes are based on information from [3,4,5]

      Tests:
      Changes are verified by comparing the results from spark.SizeEstimator.estimate
      to those from MAT. An example can be found in
      https://github.com/shivaram/spark/tree/size-estimate-test/res where the first
      100 lines from the 1990 dataset was used. The file spark-err-log.txt shows that
      the size estimate was 25000 bytes which matches the size of the hashmap entry in
      BoundedMemoryCache entry found in spark-bounded-memory.txt.

      Also, a simple script that can be used to run such a test with any text file can
      be found in the `size-estimate-test` tree as `run_size_estimate_test` With this
      patch and the original dataset from the motivation, we get an estimate of
      3878.81MB while MAT reports memory usage of 3879.61MB. The difference is
      explained below.

      Caveats:
      Arrays are still sampled during estimation and this could lead to small
      variations as seen with the example above. Also, the patch uses HotSpot
      diagnostic MXBean to figure out if compressed oops are being used by the JVM and
      this may fail on non-hotspot JVMs. Finally, the patch has been tested only on
      64-bit HotSpot JVMs (1.6.0_24 and 1.7.0_147-icedtea). I don't have access to a
      32-bit machine to test it, but we can do it on EC2.

      [1] http://stat-computing.org/dataexpo/2009/the-data.html
      [2] http://eclipse.org/mat
      [3] https://wikis.oracle.com/display/HotSpotInternals/CompressedOops
      [4] http://kohlerm.blogspot.com/2008/12/how-much-memory-is-used-by-my-java.html
      [5] http://lingpipe-blog.com/2010/06/22/the-unbearable-heaviness-jav-strings/

      Attachments

        Activity

          People

            Unassigned Unassigned
            shivaram Shivaram Venkataraman
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: