[SPARK-14209] Application failure during preemption. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.1
Fix Version/s: 1.6.3, 2.0.1, 2.1.0
Component/s: Block Manager, Spark Core
Labels:
None
Environment:

Spark on YARN

Description

We have a fair-sharing cluster set up, including the external shuffle service. When a new job arrives, existing jobs are successfully preempted down to fit.

A spate of these messages arrives:
ExecutorLostFailure (executor 48 exited unrelated to the running tasks) Reason: Container container_1458935819920_0019_01_000143 on host: ip-10-12-46-235.us-west-2.compute.internal was preempted.

This seems fine - the problem is that soon thereafter, our whole application fails because it is unable to fetch blocks from the pre-empted containers:

org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 locations. Most recent failure cause:
Caused by: java.io.IOException: Failed to connect to ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
Caused by: java.net.ConnectException: Connection refused: ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681

Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf

Spark does not attempt to recreate these blocks - the tasks simply fail over and over until the maxTaskAttempts value is reached.

It appears to me that there is some fault in the way preempted containers are being handled - shouldn't these blocks be recreated on demand?

Attachments

Issue Links

is related to

SPARK-17485 Failed remote cached block reads can lead to whole job failure

Resolved

Activity

People

Assignee:: Josh Rosen

Reporter:: Miles Crawford

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 28/Mar/16 17:48

Updated:: 17/May/20 18:21

Resolved:: 22/Sep/16 18:17