[SPARK-12516] Properly handle NM failure situation for Spark on Yarn - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 1.6.0
Fix Version/s: None
Component/s: Spark Core, YARN
Labels:
None

Description

Failure of NodeManager will make all the executors belong to that NM exit silently.

Currently in the implementation of YarnSchedulerBackend, driver will receive onDisconnect event when executor is lost, which will further ask AM to get the lost reason, AM will hold this query connection until RM report back the status of lost container, and reply back to driver. In the case of NM failure, RM cannot detect this failure immediately until timeout (10 mins by default), so the driver query of lost reason will be timed out (120 seconds), after timed out the executor states in the driver side will be cleaned out, but in the AM side, this states will still be maintained until NM heartbeat timeout. So this will potentially introduce some unexpected behaviors:

—

In the dynamic allocation disabled situation, executor number in the driver side is less than the number in the AM side after timeout (from 120 seconds to 10 minutes), and cannot be ramped up to the expected number until RM detect the failure of NM and make the related containers as complected.

For example the target executor number is 10, with 5 NMs (each NM has 2 executors). So when 1 NM is failed, 2 related executors are lost. After driver side query timeout, the executor number in driver side is 8, but in AM side it is still 10, so AM will not request additional containers until the number in AM reaches to 8 (after 10 minutes).

—

When dynamic allocation is enabled, the number of target executor is maintained both in the driver and AM side and synced between them. The target executor number will be correct after driver query timeout (120 seconds), but this number is incorrect in the AM side until NM failure is detected (10 minutes). In such case the actual executor number is less than the calculated one.

For example, current target executor number in driver is N, and in AM side is M, so M - N is the lost number.

When the executor number needs to ramp up to A, so the actual number will be A - (M - N).

When the executor number needs to bring down to B, so the actual number will be max(0, B - (M - N)). when the actual number of executors is 0, the whole system is hang, will only be recovered if driver request more resources, or after 10 minutes timeout.

This can be reproduced by running SparkPi example in the yarn-client mode with follow configurations:

spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.initialExecutors 2
spark.dynamicAllocation.maxExecutors 3

In the middle of job, killing one NM which only has executors running.

—

Possbile solutions:

Sync the actual executor number from the driver to AM after RPC timeout (120 seconds), also clean the related states in the AM.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Saisai Shao

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Dec/15 07:58

Updated:: 17/May/20 18:16

Resolved:: 12/Feb/19 20:43