Uploaded image for project: 'REEF'
  1. REEF
  2. REEF-1223

IMRU Fault Tolerance - restart failed evaluators

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.16
    • Component/s: IMRU, REEF.NET
    • Labels:

      Description

      Currently in .Net Group Communication and IMRU scenario, if one of the Evaluator failed for whatever reason, all the Evaluators will be killed by the driver.

      There are multiple levels of fault tolerant. The scenario we would like to support in this JIRA is:

      • When an evaluator failed, this failed evaluator will be killed and other good Evaluators will stay, but all the tasks running on those Evaluators will be stopped.
      • A new Evaluator will be requested and started with the original task.
      • Same tasks will be resubmitted to the rest the Evaluators
      • The topology of those tasks will be kept in the same group communication as before
      • The data that have been downloaded in those good Evaluators will stay.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                juliaw Julia Wang
                Reporter:
                juliaw Julia Wang
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: