Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
Currently in capacity Scheduler when capacity configuration is wrong
RM will shutdown, but not incase of NodeLabels capacity mismatch
In CapacityScheduler#initializeQueues
private void initializeQueues(CapacitySchedulerConfiguration conf) throws IOException { root = parseQueue(this, conf, null, CapacitySchedulerConfiguration.ROOT, queues, queues, noop); labelManager.reinitializeQueueLabels(getQueueToLabels()); root = parseQueue(this, conf, null, CapacitySchedulerConfiguration.ROOT, queues, queues, noop); LOG.info("Initialized root queue " + root); initializeQueueMappings(); setQueueAcls(authorizer, queues); }
labelManager is initialized from queues and calculation for Label level capacity mismatch happens in parseQueue . So during initialization parseQueue the labels will be empty .
Steps to reproduce
- Configure RM with capacity scheduler
- Add one or two node label from rmadmin
- Configure capacity xml with nodelabel but issue with capacity configuration for already added label
- Restart both RM
- Check on service init of capacity scheduler node label list is populated
Expected
RM should not start
Current exception on reintialize check
2015-07-07 19:18:25,655 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Initialized queue: default: capacity=0.5, absoluteCapacity=0.5, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=0, numContainers=0 2015-07-07 19:18:25,656 WARN org.apache.hadoop.yarn.server.resourcemanager.AdminService: Exception refresh queues. java.io.IOException: Failed to re-init queues at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:383) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:376) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:605) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:314) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: java.lang.IllegalArgumentException: Illegal capacity of 0.5 for children of queue root for label=node2 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.setChildQueues(ParentQueue.java:159) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:639) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:503) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:379) ... 8 more 2015-07-07 19:18:25,656 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf OPERATION=refreshQueues TARGET=AdminService RESULT=FAILURE DESCRIPTION=Exception refresh queues. PERMISSIONS= 2015-07-07 19:18:25,656 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf OPERATION=transitionToActive TARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS= 2015-07-07 19:18:25,656 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:321) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.IOException: Failed to re-init queues at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:617) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:314) ... 5 more
Attachments
Attachments
Issue Links
- is related to
-
YARN-2492 (Clone of YARN-796) Allow for (admin) labels on nodes and resource-requests
- Open