Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9877

Intermittent TIME_OUT of LogAggregationReport

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.3, 3.3.0, 3.2.1, 3.1.3
    • Fix Version/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      I noticed some intermittent TIME_OUT in some downstream log-aggregation based tests.

      Steps to reproduce:

      • Let's run a MR job
        hadoop jar hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep -Dmapreduce.job.queuename=root.default -m 10 -r 10 -mt 5000 -rt 5000
        
        
      • Suppose the AM is requesting more containers, but as soon as they're allocated - the AM realizes it doesn't need them. The container's state changes are: ALLOCATED -> ACQUIRED -> RELEASED.
        Let's suppose these extra containers are allocated in a different node from the other 21 (AM + 10 mapper + 10 reducer) containers' node.
      • All the containers finish successfully and the app is finished successfully as well. Log aggregation status for the whole app seemingly stucks in RUNNING state.
      • After a while the final log aggregation status for the app changes to TIME_OUT.

      Root cause:

      • As unused containers are getting through the state transition in the RM's internal representation, RMAppImpl$AppRunningOnNodeTransition's transition function is called. This calls the RMAppLogAggregation$addReportIfNecessary which forcefully adds the "NOT_START" LogAggregationStatus associated with this NodeId for the app, even though it does not have any running container on it.
      • The node's LogAggregationStatus is never updated to "SUCCEEDED" by the NodeManager because it does not have any running container on it (Note that the AM immediately released them after acquisition). The LogAggregationStatus remains NOT_START until time out is reached. After that point the RM aggregates the LogAggregationReports for all the nodes, and though all the containers have SUCCEEDED state, one particular node has NOT_START, so the final log aggregation will be TIME_OUT.
        (I crawled the RM UI for the log aggregation statuses, and it was always NOT_START for this particular node).

      This situation is highly unlikely, but has an estimated ~0.8% of failure rate based on a year's 1500 run on an unstressed cluster.

        Attachments

        1. YARN-9877.001.patch
          4 kB
          Adam Antal

          Issue Links

            Activity

              People

              • Assignee:
                adam.antal Adam Antal
                Reporter:
                adam.antal Adam Antal
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated: