Uploaded image for project: 'Apache Nemo'
  1. Apache Nemo
  2. NEMO-137

Retry parent task(s) upon task INPUT_READ_FAILURE

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None

    Description

      https://issues.apache.org/jira/browse/NEMO-54

      assumes INPUT_READ_FAILURE happens only due to executor failures, where parent task(s) are guaranteed to be retried 

      Although probably less common, reading input data can fail for other reasons such as disk corruption, or storage node failure in case of using disaggregated storage for maintaining intermediate data. In such cases, parent task(s) need to be retried as part of handling INPUT_READ_FAILURE. Otherwise, only the failed task will be retried over and over until the maximum task attempt limit is reached, and the job fails.

      Implementing this feature is harder than it sounds. First, TaskExecutor should learn from DataFetchers exactly which block it failed to read, and send the information to the master to retry the responsible parent task. Unfortunately, it's a little tricky to get the block information due to the way our InputReader is designed. Second, we should be careful with block state transitions, especially when handling 'push' edges. For example, we may want to ignore transitions for IN_PROGRESS->NOT_AVAILABLE, which can happen when the parent task of a 'push' edge is retried with EXECUTOR_REMOVED, before the INPUT_READ_FAILURE event arrives.

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              johnyangk John Yang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: