Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9209

When nodePartition is not set in Placement Constraints, containers are allocated only in default partition

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      When application sets a placement constraint without specifying a nodePartition, the default partition is always chosen as the constraint when allocating containers. This can be a problem. when an application is submitted to a queue which has doesn't have enough capacity available on the default partition.

      This is a common scenario when node labels are configured for a particular queue. The below sample sleeper service cannot get even a single container allocated when it is submitted to a "labeled_queue", even though enough capacity is available on the label/partition configured for the queue. Only the AM container runs.

      {
          "name": "sleeper-service",
          "version": "1.0.0",
          "queue": "labeled_queue",
          "components": [
              {
                  "name": "sleeper",
                  "number_of_containers": 2,
                  "launch_command": "sleep 90000",
                  "resource": {
                      "cpus": 1,
                      "memory": "4096"
                  },
                  "placement_policy": {
                      "constraints": [
                          {
                              "type": "ANTI_AFFINITY",
                              "scope": "NODE",
                              "target_tags": [
                                  "sleeper"
                              ]
                          }
                      ]
                  }
              }
          ]
      }
      

      It runs fine if I specify the node_partition explicitly in the constraints like below.

      {
          "name": "sleeper-service",
          "version": "1.0.0",
          "queue": "labeled_queue",
          "components": [
              {
                  "name": "sleeper",
                  "number_of_containers": 2,
                  "launch_command": "sleep 90000",
                  "resource": {
                      "cpus": 1,
                      "memory": "4096"
                  },
                  "placement_policy": {
                      "constraints": [
                          {
                              "type": "ANTI_AFFINITY",
                              "scope": "NODE",
                              "target_tags": [
                                  "sleeper"
                              ],
                              "node_partitions": [
                                  "label"
                              ]
                          }
                      ]
                  }
              }
          ]
      }
      

      The problem seems to be because only the default partition "" is considered when node_partition constraint is not specified as seen in below RM log.

      2019-01-17 16:51:59,921 INFO placement.SingleConstraintAppPlacementAllocator (SingleConstraintAppPlacementAllocator.java:validateAndSetSchedulingRequest(367)) - Successfully added SchedulingRequest to app=appattempt_1547734161165_0010_000001 targetAllocationTags=[sleeper]. nodePartition= 
      

      However, I think it makes more sense to consider "*" or the default-node-label-expression of the queue if configured, when no node_partition is specified in the placement constraint. Since not specifying any node_partition should ideally mean we don't enforce placement constraints on any node_partition. However we are enforcing the default partition instead now.

      Attachments

        1. YARN-9209.001.patch
          3 kB
          Tarun Parimi
        2. YARN-9209.002.patch
          5 kB
          Tarun Parimi
        3. YARN-9209.003.patch
          5 kB
          Tarun Parimi

        Activity

          People

            tarunparimi Tarun Parimi
            tarunparimi Tarun Parimi
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: