[SPARK-13369] Number of consecutive fetch failures for a stage before the job is aborted should be configurable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 2.2.0
Component/s: Spark Core
Labels:
None

Description

The previously hardcoded max 4 retries per stage is not suitable for all cluster configurations. Since spark retries a stage at the sign of the first fetch failure, you can easily end up with many stage retries to discover all the failures. In particular, two scenarios this value should change are (1) if there are more than 4 executors per node; in that case, it may take 4 retries to discover the problem with each executor on the node and (2) during cluster maintenance on large clusters, where multiple machines are serviced at once, but you also cannot afford total cluster downtime. By making this value configurable, cluster managers can tune this value to something more appropriate to their cluster configuration.

Attachments

Issue Links

links to

[Github] Pull Request #11254 (sitalkedia)

[Github] Pull Request #17307 (sitalkedia)

Activity

People

Assignee:: Sital Kedia

Reporter:: Sital Kedia

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Feb/16 04:15

Updated:: 17/Mar/17 14:44

Resolved:: 17/Mar/17 14:35