Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4227

Ignore expired containers from removed nodes in FairScheduler

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.3.0, 2.5.0, 2.7.1
    • 3.1.0, 2.10.0
    • fairscheduler
    • None
    • Reviewed

    Description

      Under some circumstances the node is removed before an expired container event is processed causing the RM to exit:

      2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1436927988321_1307950_01_000012 Timed out after 600 secs
      2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1436927988321_1307950_01_000012 Container Transitioned from ACQUIRED to EXPIRED
      2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: Completed container: container_1436927988321_1307950_01_000012 in state: EXPIRED event:EXPIRE
      2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op	OPERATION=AM Released Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1436927988321_1307950	CONTAINERID=container_1436927988321_1307950_01_000012
      2015-10-04 21:14:01,063 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type CONTAINER_EXPIRED to the scheduler
      java.lang.NullPointerException
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122)
      	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585)
      	at java.lang.Thread.run(Thread.java:745)
      2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
      

      The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 and 2.6.0 by different customers.

      Attachments

        1. YARN-4227.patch
          1 kB
          wilfreds#1
        2. YARN-4227.2.patch
          6 kB
          wilfreds#1
        3. YARN-4227.3.patch
          6 kB
          wilfreds#1
        4. YARN-4227.4.patch
          6 kB
          wilfreds#1
        5. YARN-4227.5.patch
          6 kB
          wilfreds#1
        6. YARN-4227.006.patch
          7 kB
          wilfreds#1

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            wilfreds Wilfred Spiegelenburg
            wilfreds Wilfred Spiegelenburg
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment