[SPARK-26329] ExecutorMetrics should poll faster than heartbeats - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.0.0
Component/s: Spark Core, Web UI
Labels:
- release-notes

Description

We should allow faster polling of the executor memory metrics (~~SPARK-23429~~ / SPARK-23206) without requiring a faster heartbeat rate. We've seen the memory usage of executors pike over 1 GB in less than a second, but heartbeats are only every 10 seconds (by default). Spark needs to enable fast polling to capture these peaks, without causing too much strain on the system.

In the current implementation, the metrics are polled along with the heartbeat, but this leads to a slow rate of polling metrics by default. If users were to increase the rate of the heartbeat, they risk overloading the driver on a large cluster, with too many messages and too much work to aggregate the metrics. But, the executor could poll the metrics more frequently, and still only send the max since the last heartbeat for each metric. This keeps the load on the driver the same, and only introduces a small overhead on the executor to grab the metrics and keep the max.

The downside of this approach is that we still need to wait for the next heartbeat for the driver to be aware of the new peak. If the executor dies or is killed before then, then we won't find out. A potential future enhancement would be to send an update anytime there is an increase by some percentage, but we'll leave that out for now.

Another possibility would be to change the metrics themselves to track peaks for us, so we don't have to fine-tune the polling rate. For example, some jvm metrics provide a usage threshold, and notification: https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryPoolMXBean.html#UsageThreshold

But, that is not available on all metrics. This proposal gives us a generic way to get a more accurate peak memory usage for all metrics.

Attachments

Issue Links

is related to

SPARK-23206 Additional Memory Tuning Metrics

Open

links to

GitHub Pull Request #23767

GitHub Pull Request #28010

GitHub Pull Request #28011

Activity

People

Assignee:: Wing Yew Poon

Reporter:: Imran Rashid

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 10/Dec/18 19:32

Updated:: 27/Mar/20 17:57

Resolved:: 01/Aug/19 14:10