Uploaded image for project: 'Apache Nemo'
  1. Apache Nemo
  2. NEMO-50

Carefully retry tasks in the scheduler

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.1

    Description

      An executor failure results in loss of local data blocks (e.g., LocalFileStore, MemoryStore), and interruption of tasks that were running at the time of failure. Then, the tasks who produced the lost blocks and the tasks that were interrupted become eligible for re-execution.

      Given this situation, the scheduler should figure out (1) which tasks really need to be re-executed, and (2) in what order they should be re-executed. For example, if all downstream tasks of a lost block have completed and their outputs are safe, then we don't need to retry the producer of that lost block. We should also retry tasks in the order of their dependencies to prevent the deadlock situation where executor slots are filled with downstream tasks waiting for upstream tasks that are waiting for an available slot.

      Attachments

        Issue Links

          Activity

            People

              johnyangk John Yang
              johnyangk John Yang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: