[SPARK-17485] Failed remote cached block reads can lead to whole job failure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.6.2, 2.0.0
Fix Version/s: 1.6.3, 2.0.1, 2.1.0
Component/s: Block Manager, Spark Core
Labels:
None

Target Version/s:

1.6.3, 2.0.1

Description

In Spark's RDD.getOrCompute we first try to read a local copy of a cached block, then a remote copy, and only fall back to recomputing the block if no cached copy (local or remote) can be read. This logic works correctly in the case where no remote copies of the block exist, but if there are remote copies but reads of those copies fail (due to network issues or internal Spark bugs) then the BlockManager will throw a BlockFetchException error that fails the entire job.

In the case of torrent broadcast we really do want to fail the entire job in case no remote blocks can be fetched, but this logic is inappropriate for cached blocks because those can/should be recomputed.

Therefore, I think that this exception should be thrown higher up the call stack by the BlockManager client code and not the block manager itself.

Attachments

Issue Links

relates to

SPARK-14209 Application failure during preemption.

Resolved

SPARK-17484 Race condition when cancelling a job during a cache write can lead to block fetch failures

Resolved

links to

[Github] Pull Request #15037 (JoshRosen)

[Github] Pull Request #15186 (JoshRosen)

Activity

People

Assignee:: Josh Rosen

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Sep/16 21:31

Updated:: 17/May/20 18:20

Resolved:: 22/Sep/16 18:06