Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4833

Task can get stuck in FAIL_CONTAINER_CLEANUP

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      If an NM goes down and the AM still tries to launch a container on it the ContainerLauncherImpl can get stuck in an RPC timeout. At the same time the RM may notice that the NM has gone away and inform the AM of this, this triggers a TA_FAILMSG. If the TA_FAILMSG arrives at the TaskAttemptImpl before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try to kill the container, but the ContainerLauncherImpl will not send back a TA_CONTAINER_CLEANED event causing the attempt to be stuck.

      Attachments

        1. MAPREDUCE4833-2.patch
          9 kB
          Robert Parker
        2. MAPREDUCE4833-1.patch
          9 kB
          Robert Parker
        3. MAPREDUCE4833.patch
          9 kB
          Robert Parker

        Activity

          People

            robsparker Robert Parker
            revans2 Robert Joseph Evans
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: