[SPARK-385] Changes to SizeEstimator more accurate - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Motivation:
This patch is motivated by an observation that the amount of heap space used by
the BoundedMemoryCache is often much larger than what we account for.
For example running with 10 files from the RITA dataset[1] and 4GB for the cache,
we see the currBytes variable to be 3917.67. However analyzing the memory heap
with Eclipse MAT[2] shows that the BoundedMemoryCache in fact occupies 4360.55 MB.

Changes made:
This patch tries to address the discrepancy by making some changes to the
SizeEstimator. The object size and reference size are changed based on
the architecture in use and if or not CompressedOops are in use by the JVM. This
results in the object size changing from 8 to 12 or 16 and references being
either 4 or 8 bytes long. We also account for the fact that arrays have an
object header + an int for the length. Lastly, this patch also account for the
fact that fields and objects are aligned to 8-byte boundaries by the JVM.
Changes are based on information from [3,4,5]

Tests:
Changes are verified by comparing the results from spark.SizeEstimator.estimate
to those from MAT. An example can be found in
https://github.com/shivaram/spark/tree/size-estimate-test/res where the first
100 lines from the 1990 dataset was used. The file spark-err-log.txt shows that
the size estimate was 25000 bytes which matches the size of the hashmap entry in
BoundedMemoryCache entry found in spark-bounded-memory.txt.

Also, a simple script that can be used to run such a test with any text file can
be found in the `size-estimate-test` tree as `run_size_estimate_test` With this
patch and the original dataset from the motivation, we get an estimate of
3878.81MB while MAT reports memory usage of 3879.61MB. The difference is
explained below.

Caveats:
Arrays are still sampled during estimation and this could lead to small
variations as seen with the example above. Also, the patch uses HotSpot
diagnostic MXBean to figure out if compressed oops are being used by the JVM and
this may fail on non-hotspot JVMs. Finally, the patch has been tested only on
64-bit HotSpot JVMs (1.6.0_24 and 1.7.0_147-icedtea). I don't have access to a
32-bit machine to test it, but we can do it on EC2.

[1] http://stat-computing.org/dataexpo/2009/the-data.html
[2] http://eclipse.org/mat
[3] https://wikis.oracle.com/display/HotSpotInternals/CompressedOops
[4] http://kohlerm.blogspot.com/2008/12/how-much-memory-is-used-by-my-java.html
[5] http://lingpipe-blog.com/2010/06/22/the-unbearable-heaviness-jav-strings/

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Shivaram Venkataraman

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 09/Aug/12 17:07

Updated:: 19/Oct/12 22:50

Resolved:: 19/Oct/12 22:50