XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 2.9.0, 3.0.0, 3.1.0
    • None
    • None

    Description

      Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 is active. First, put root.a into STOPPED state, then remove it. Then put rm1 in standby and rm2 in active. Here's the exception:

      Operation failed: Error on refreshAll during transition to Active
      	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
      	at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
      	at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
      	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
      	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
      	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
      Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation failed
      	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747)
      	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
      	... 10 more
      Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted from the new capacity scheduler configuration, but the queue is not yet in stopped state. Current State : RUNNING
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436)
      	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405)
      	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736)
      	... 11 more
      Caused by: java.io.IOException: root.a is deleted from the new capacity scheduler configuration, but the queue is not yet in stopped state. Current State : RUNNING
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432)
      	... 13 more

      Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and sees it is deleted, it throws exception.

      Attachments

        1. YARN-7252-YARN-5734.002.patch
          10 kB
          Jonathan Hung
        2. YARN-7252-YARN-5734.001.patch
          9 kB
          Jonathan Hung

        Activity

          People

            jhung Jonathan Hung
            jhung Jonathan Hung
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: