Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9449

Non-exclusive labels can create reservation loop on cluster without unlabeled node

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.8.5
    • None
    • None
    • None

    Description

      https://issues.apache.org/jira/browse/YARN-5342 Added a counter to Yarn so that unscheduled resource requests were attempted to be scheduled on unlabeled nodes first.
      This counter is reset only when an attempt to schedule happens on an unlabeled node.

      On hadoop clusters with only labeled nodes, this counter can never be reset and therefore it will block skipping that node.
      Because the node will not be skipped, it creates the loop shown below in the Yarn RM logs.

      This can block scheduling of a spark executor for example and cause the spark application to get stuck.

       

      2019-02-18 23:54:22,591 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (ResourceManager Event Processor): container_1550533628872_0003_01_000023 Container Transitioned from NEW to RESERVED 2019-02-18 23:54:22,591 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator (ResourceManager Event Processor): Reserved container application=application_1550533628872_0003 resource=<memory:11264, vCores:1> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3 cluster=<memory:24576, vCores:16> 2019-02-18 23:54:22,592 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue (ResourceManager Event Processor): assignedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:24576, vCores:16> 2019-02-18 23:54:23,592 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler (ResourceManager Event Processor): Trying to fulfill reservation for application application_1550533628872_0003 on node: ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:23,592 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp (ResourceManager Event Processor): Application application_1550533628872_0003 unreserved on node host: ip-10-0-0-122.ec2.internal:8041 #containers=1 available=<memory:1024, vCores:7> used=<memory:11264, vCores:1>, currently has 0 at priority 1; currentReservation <memory:0, vCores:0> on node-label=LABELED 2019-02-18 23:54:23,593 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (ResourceManager Event Processor): container_1550533628872_0003_01_000024 Container Transitioned from NEW to RESERVED 2019-02-18 23:54:23,593 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator (ResourceManager Event Processor): Reserved container application=application_1550533628872_0003 resource=<memory:11264, vCores:1> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3 cluster=<memory:24576, vCores:16> 2019-02-18 23:54:23,593 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue (ResourceManager Event Processor): assignedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:24576, vCores:16> 2019-02-18 23:54:24,593 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler (ResourceManager Event Processor): Trying to fulfill reservation for application application_1550533628872_0003 on node: ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:24,593 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp (ResourceManager Event Processor): Application application_1550533628872_0003 unreserved on node host: ip-10-0-0-122.ec2.internal:8041 #containers=1 available=<memory:1024, vCores:7> used=<memory:11264, vCores:1>, currently has 0 at priority 1; currentReservation <memory:0, vCores:0> on node-label=LABELED 2019-02-18 23:54:24,594 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (ResourceManager Event Processor): container_1550533628872_0003_01_000025 Container Transitioned from NEW to RESERVED 2019-02-18 23:54:24,594 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator (ResourceManager Event Processor): Reserved container application=application_1550533628872_0003 resource=<memory:11264, vCores:1> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3 cluster=<memory:24576, vCores:16> 2019-02-18 23:54:24,594 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue (ResourceManager Event Processor): assignedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:24576, vCores:16> 2019-02-18 23:54:25,594 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler (ResourceManager Event Processor): Trying to fulfill reservation for application application_1550533628872_0003 on node: ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:25,595 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp (ResourceManager Event Processor): Application application_1550533628872_0003 unreserved on node host: ip-10-0-0-122.ec2.internal:8041 #containers=1 available=<memory:1024, vCores:7> used=<memory:11264, vCores:1>, currently has 0 at priority 1; currentReservation <memory:0, vCores:0> on node-label=LABELED

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bdscheller Brandon Scheller
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: