[NEMO-137] Retry parent task(s) upon task INPUT_READ_FAILURE - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Labels:
- runtime

Description

https://issues.apache.org/jira/browse/NEMO-54

assumes INPUT_READ_FAILURE happens only due to executor failures, where parent task(s) are guaranteed to be retried

Although probably less common, reading input data can fail for other reasons such as disk corruption, or storage node failure in case of using disaggregated storage for maintaining intermediate data. In such cases, parent task(s) need to be retried as part of handling INPUT_READ_FAILURE. Otherwise, only the failed task will be retried over and over until the maximum task attempt limit is reached, and the job fails.

Implementing this feature is harder than it sounds. First, TaskExecutor should learn from DataFetchers exactly which block it failed to read, and send the information to the master to retry the responsible parent task. Unfortunately, it's a little tricky to get the block information due to the way our InputReader is designed. Second, we should be careful with block state transitions, especially when handling 'push' edges. For example, we may want to ignore transitions for IN_PROGRESS->NOT_AVAILABLE, which can happen when the parent task of a 'push' edge is retried with EXECUTOR_REMOVED, before the INPUT_READ_FAILURE event arrives.

Attachments

Issue Links

is blocked by

NEMO-144 Improve Data Plane Code

Open

is related to

NEMO-418 BlockFetchFailureProperty

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: John Yang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 30/Jun/18 07:31

Updated:: 26/Aug/19 07:14