Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0, 3.1.0
    • Component/s: None
    • Labels:
      None

      Description

      Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 is active. First, put root.a into STOPPED state, then remove it. Then put rm1 in standby and rm2 in active. Here's the exception:

      Operation failed: Error on refreshAll during transition to Active
      	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
      	at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
      	at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
      	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
      	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
      	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
      Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation failed
      	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747)
      	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
      	... 10 more
      Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted from the new capacity scheduler configuration, but the queue is not yet in stopped state. Current State : RUNNING
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436)
      	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405)
      	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736)
      	... 11 more
      Caused by: java.io.IOException: root.a is deleted from the new capacity scheduler configuration, but the queue is not yet in stopped state. Current State : RUNNING
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432)
      	... 13 more

      Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and sees it is deleted, it throws exception.

        Attachments

        1. YARN-7252-YARN-5734.002.patch
          10 kB
          Jonathan Hung
        2. YARN-7252-YARN-5734.001.patch
          9 kB
          Jonathan Hung

          Activity

            People

            • Assignee:
              jhung Jonathan Hung
              Reporter:
              jhung Jonathan Hung
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: