Hadoop YARN
  1. Hadoop YARN
  2. YARN-74

nodemanager should cleanup running containers when shutdown

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 0.23.3
    • Fix Version/s: None
    • Component/s: nodemanager
    • Labels:
      None

      Description

      Currently the nodemanager doesn't cleanup running containers when it gets restarted. This can cause containers to get lost and stick around forever. We've seen this happen multiple times when the RM is restarted. When the RM is brought back up, it doesn't know about what was running on the cluster, it tells the NMs to reboot and when the NM reboots it loses what it had running. If there are any containers that are behaving badly there is no one left that knows about them to kill them.

      We should try to kill any running containers when the node manager is shutting down. We should also check when the nodemanager is being brought back up - but that will be a separate jira.

      This might change a bit when RM restart is implemented if tasks can actually survive across RM/NM being rebooted, but that can be addressed at that point.

        Issue Links

          Activity

          Thomas Graves created issue -
          Hide
          Bikas Saha added a comment -

          Once this jira is done, in which scenarios do you see NM terminating tasks upon start up also?

          For RM restart, I think it might be ok for NM's to terminate running tasks upon shutdown as long as NM gives the RM some time to come back up. If the RM comes back up within that much time, then it can take over control of the tasks as if nothing has happened. If it does not, then I think its best for the NM to terminate the resources utilization it is responsible for, and leave the node in the state it had been upon startup. Thoughts?

          Show
          Bikas Saha added a comment - Once this jira is done, in which scenarios do you see NM terminating tasks upon start up also? For RM restart, I think it might be ok for NM's to terminate running tasks upon shutdown as long as NM gives the RM some time to come back up. If the RM comes back up within that much time, then it can take over control of the tasks as if nothing has happened. If it does not, then I think its best for the NM to terminate the resources utilization it is responsible for, and leave the node in the state it had been upon startup. Thoughts?
          Hide
          Thomas Graves added a comment -

          It would terminate containers on startup in cases where NM didn't shut down gracefully or somehow missed something on shutdown - hardware issues, NM crashes, etc.

          Show
          Thomas Graves added a comment - It would terminate containers on startup in cases where NM didn't shut down gracefully or somehow missed something on shutdown - hardware issues, NM crashes, etc.
          Vinod Kumar Vavilapalli made changes -
          Field Original Value New Value
          Project Hadoop Map/Reduce [ 12310941 ] Hadoop YARN [ 12313722 ]
          Key MAPREDUCE-4213 YARN-74
          Affects Version/s 0.23.3 [ 12322841 ]
          Affects Version/s 0.23.3 [ 12320060 ]
          Target Version/s 0.23.3 [ 12320060 ]
          Component/s nodemanager [ 12319323 ]
          Component/s mrv2 [ 12314301 ]
          Component/s nodemanager [ 12315341 ]
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Duplicate of YARN-72.

          Show
          Vinod Kumar Vavilapalli added a comment - Duplicate of YARN-72 .
          Vinod Kumar Vavilapalli made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Duplicate [ 3 ]
          Vinod Kumar Vavilapalli made changes -
          Link This issue is duplicated by YARN-72 [ YARN-72 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          122d 6h 24m 1 Vinod Kumar Vavilapalli 31/Aug/12 20:59

            People

            • Assignee:
              Unassigned
              Reporter:
              Thomas Graves
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development