Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
https://issues.apache.org/jira/browse/NEMO-54
assumes INPUT_READ_FAILURE happens only due to executor failures, where parent task(s) are guaranteed to be retried
Although probably less common, reading input data can fail for other reasons such as disk corruption, or storage node failure in case of using disaggregated storage for maintaining intermediate data. In such cases, parent task(s) need to be retried as part of handling INPUT_READ_FAILURE. Otherwise, only the failed task will be retried over and over until the maximum task attempt limit is reached, and the job fails.
Implementing this feature is harder than it sounds. First, TaskExecutor should learn from DataFetchers exactly which block it failed to read, and send the information to the master to retry the responsible parent task. Unfortunately, it's a little tricky to get the block information due to the way our InputReader is designed. Second, we should be careful with block state transitions, especially when handling 'push' edges. For example, we may want to ignore transitions for IN_PROGRESS->NOT_AVAILABLE, which can happen when the parent task of a 'push' edge is retried with EXECUTOR_REMOVED, before the INPUT_READ_FAILURE event arrives.