Hadoop YARN
  1. Hadoop YARN
  2. YARN-68

NodeManager will refuse to shutdown indefinitely due to container log aggregation

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.3
    • Fix Version/s: 2.0.2-alpha, 0.23.3
    • Component/s: nodemanager
    • Labels:
      None
    • Environment:

      QE

      Description

      The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
      indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present.

      Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:

      [Thread-1]2012-08-21 17:44:07,581 INFO
      org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
      Waiting for aggregation to complete for application_1345221477405_2733

      The only recovery we found to work was to 'kill -9' the nm process.

      What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

      1. YARN-68.patch
        9 kB
        Daryn Sharp
      2. YARN-68-1.patch
        10 kB
        Daryn Sharp

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Daryn Sharp
            Reporter:
            patrick white
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development