Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4725 [Umbrella] Auto-­restart of containers
  3. YARN-3998

Add support in the NodeManager to re-launch containers

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None

      Description

      I'd like to add a field(retry-times) in ContainerLaunchContext. When AM launches containers, it could specify the value. Then NM will re-launch the container 'retry-times' times when it fails to run(e.g.exit code is not 0).

      It will save a lot of time. It avoids container localization. RM does not need to re-schedule the container. And local files in container's working directory will be left for re-use.(If container have downloaded some big files, it does not need to re-download them when running again.)

      We find it is useful in systems like Storm.

        Attachments

        1. YARN-3998.09.patch
          101 kB
          Jun Gong
        2. YARN-3998.08.patch
          102 kB
          Jun Gong
        3. YARN-3998.07.patch
          92 kB
          Jun Gong
        4. YARN-3998.06.patch
          72 kB
          Jun Gong
        5. YARN-3998.05.patch
          68 kB
          Jun Gong
        6. YARN-3998.04.patch
          68 kB
          Jun Gong
        7. YARN-3998.03.patch
          133 kB
          Jun Gong
        8. YARN-3998.02.patch
          60 kB
          Jun Gong
        9. YARN-3998.01.patch
          45 kB
          Jun Gong

          Issue Links

            Activity

              People

              • Assignee:
                hex108 Jun Gong
                Reporter:
                hex108 Jun Gong
              • Votes:
                0 Vote for this issue
                Watchers:
                19 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: