Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4584

RM startup failure when AM attempts greater than max-attempts

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.9.0
    • 2.9.0, 3.0.0-alpha1
    • None
    • None
    • Reviewed

    Description

      Configure 3 queue in cluster with 8 GB

      1. queue 40%
      2. queue 50%
      3. default 10%
      • Submit applications to all 3 queue with container size as 1024MB (sleep job with 50 containers on all queues)
      • AM that gets assigned to default queue and gets preempted immediately after 20 preemption kill all application

      Due resource limit in default queue AM got prempted about 20 times
      On RM restart RM fails to restart

      2016-01-12 10:49:04,081 DEBUG org.apache.hadoop.service.AbstractService: noteFailure java.lang.NullPointerException
      2016-01-12 10:49:04,081 INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices failed in state STARTED; cause: java.lang.NullPointerException
      java.lang.NullPointerException
              at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:887)
              at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:826)
              at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:953)
              at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:946)
              at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
              at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
              at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
              at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
              at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:786)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:328)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:464)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1232)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:594)
              at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1022)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1062)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1058)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1058)
              at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:323)
              at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:127)
              at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:877)
              at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:467)
              at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
              at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
      2016-01-12 10:49:04,082 DEBUG org.apache.hadoop.service.AbstractService: Service: RMActiveServices entered state STOPPED
      2016-01-12 10:49:04,082 DEBUG org.apache.hadoop.service.CompositeService: RMActiveServices: stopping services, size=16
      
      

      Attachments

        1. 0006-YARN-4584.patch
          7 kB
          Bibin Chundatt
        2. 0005-YARN-4584.patch
          9 kB
          Bibin Chundatt
        3. 0004-YARN-4584.patch
          9 kB
          Bibin Chundatt
        4. 0003-YARN-4584.patch
          7 kB
          Bibin Chundatt
        5. 0002-YARN-4584.patch
          7 kB
          Bibin Chundatt
        6. 0001-YARN-4584.patch
          6 kB
          Bibin Chundatt

        Issue Links

          Activity

            People

              bibinchundatt Bibin Chundatt
              bibinchundatt Bibin Chundatt
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: