[SPARK-23206] Additional Memory Tuning Metrics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.2.1
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

At LinkedIn, we have multiple clusters, running thousands of Spark applications, and these numbers are growing rapidly. We need to ensure that these Spark applications are well tuned – cluster resources, including memory, should be used efficiently so that the cluster can support running more applications concurrently, and applications should run quickly and reliably.

Currently there is limited visibility into how much memory executors are using, and users are guessing numbers for executor and driver memory sizing. These estimates are often much larger than needed, leading to memory wastage. Examining the metrics for one cluster for a month, the average percentage of used executor memory (max JVM used memory across executors / spark.executor.memory) is 35%, leading to an average of 591GB unused memory per application (number of executors * (spark.executor.memory - max JVM used memory)). Spark has multiple memory regions (user memory, execution memory, storage memory, and overhead memory), and to understand how memory is being used and fine-tune allocation between regions, it would be useful to have information about how much memory is being used for the different regions.

To improve visibility into memory usage for the driver and executors and different memory regions, the following additional memory metrics can be be tracked for each executor and driver:

JVM used memory: the JVM heap size for the executor/driver.
Execution memory: memory used for computation in shuffles, joins, sorts and aggregations.
Storage memory: memory used caching and propagating internal data across the cluster.
Unified memory: sum of execution and storage memory.

The peak values for each memory metric can be tracked for each executor, and also per stage. This information can be shown in the Spark UI and the REST APIs. Information for peak JVM used memory can help with determining appropriate values for spark.executor.memory and spark.driver.memory, and information about the unified memory region can help with determining appropriate values for spark.memory.fraction and spark.memory.storageFraction. Stage memory information can help identify which stages are most memory intensive, and users can look into the relevant code to determine if it can be optimized.

The memory metrics can be gathered by adding the current JVM used memory, execution memory and storage memory to the heartbeat. SparkListeners are modified to collect the new metrics for the executors, stages and Spark history log. Only interesting values (peak values per stage per executor) are recorded in the Spark history log, to minimize the amount of additional logging.

We have attached our design documentation with this ticket and would like to receive feedback from the community for this proposal.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

StageTab.png
31/Jan/18 21:55
190 kB
Edward Lu
SPARK-23206 Design Doc.pdf
16/Apr/18 02:39
88 kB
Edward Lu
MemoryTuningMetricsDesignDoc.pdf
24/Jan/18 22:52
68 kB
Edward Lu
ExecutorsTab2.png
09/Feb/18 07:56
45 kB
Lantao Jin
ExecutorsTab.png
27/Jan/18 01:01
162 kB
Edward Lu

Issue Links

is a parent of

SPARK-23429 Add executor memory metrics to heartbeat and expose in executors REST API

Resolved

is related to

SPARK-21157 Report Total Memory Used by Spark Executors

Closed

SPARK-9103 Tracking spark's memory usage

Resolved

relates to

SPARK-26329 ExecutorMetrics should poll faster than heartbeats

Resolved

Sub-Tasks

1.	Add executor memory metrics to heartbeat and expose in executors REST API	Resolved	Edward Lu
2.	Expose the new executor memory metrics at the stage level	Resolved	Terry Kim
3.	Expose executor memory metrics in the web UI for executors	Resolved	Zhongwei Zhu
4.	Add executors' process tree total memory information to heartbeat signals	Resolved	Reza Safi
5.	Add GC information to ExecutorMetrics	Resolved	Lantao Jin
6.	Expose executor memory metrics at the stage level, in the Stages tab	Resolved	angerszhu
7.	Define query parameters to support various filtering conditions in REST API for overall stages	Resolved	angerszhu
8.	Add Executor level metrics to monitoring docs	Resolved	Lantao Jin
9.	TaskEnd event with zero Executor Metrics when task duration less then poll interval	In Progress	Unassigned
10.	Expose executor memory metrics at the task detal, in the Stages tab	Resolved	Unassigned
11.	Add new executor metrics summary REST APIs and parameters	In Progress	Unassigned
12.	Support task Metrics Distributions and executor Metrics Distributions in the REST API call for a specified stage	Resolved	angerszhu
13.	Send ExecutorMetricsUpdate EventLog appropriately	Resolved	Apache Spark
14.	Fix code close issue in monitoring.md	Resolved	angerszhu

Activity

People

Assignee:: Unassigned

Reporter:: Edward Lu

Votes:: 6 Vote for this issue

Watchers:: 46 Start watching this issue

Dates

Created:: 24/Jan/18 22:51

Updated:: 23/Jun/20 12:55