Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4519

potential deadlock of CapacityScheduler between decrease container and assign containers

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and may be get CapacityScheduler's sync lock in decreaseContainer()
      In scheduler thread, first get CapacityScheduler's sync lock in allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in FicaSchedulerApp.assignContainers().

      1. YARN-4519.1.patch
        14 kB
        MENG DING
      2. YARN-4519.2.patch
        14 kB
        MENG DING
      3. YARN-4519.3.patch
        43 kB
        MENG DING

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #9280 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9280/)
          Move YARN-4519 in CHANGES.txt to 2.8 (jianhe: rev 39a71b606a4abdc46ea6dfce2b79480b47ec18b5)

          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #9280 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9280/ ) Move YARN-4519 in CHANGES.txt to 2.8 (jianhe: rev 39a71b606a4abdc46ea6dfce2b79480b47ec18b5) hadoop-yarn-project/CHANGES.txt
          Hide
          jianhe Jian He added a comment -

          committed in 2.8 too

          Show
          jianhe Jian He added a comment - committed in 2.8 too
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #9202 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9202/)
          YARN-4519. Potential deadlock of CapacityScheduler between decrease (jianhe: rev 7f46636495e23693d588b0915f464fa7afd9102e)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedContainerChangeRequest.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerResizing.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9202 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9202/ ) YARN-4519 . Potential deadlock of CapacityScheduler between decrease (jianhe: rev 7f46636495e23693d588b0915f464fa7afd9102e) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedContainerChangeRequest.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerResizing.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java
          Hide
          jianhe Jian He added a comment -

          Committed to trunk, branch-2, thanks MENG DING !
          thanks Wangda, sandflee for the review !

          Show
          jianhe Jian He added a comment - Committed to trunk, branch-2, thanks MENG DING ! thanks Wangda, sandflee for the review !
          Hide
          mding MENG DING added a comment -

          Thanks Wangda Tan. I will log a separate ticket for completedContaine once the fix of this issue is approved.

          Show
          mding MENG DING added a comment - Thanks Wangda Tan . I will log a separate ticket for completedContaine once the fix of this issue is approved.
          Hide
          leftnoteasy Wangda Tan added a comment -

          MENG DING,

          It seems to me schedulerHealth doesn't need synchronized lock as well. It will be eventually consistent, and all maps in schedulerHealth are concurrent map, so we should expect it is consistent in most of the time.

          Show
          leftnoteasy Wangda Tan added a comment - MENG DING , It seems to me schedulerHealth doesn't need synchronized lock as well. It will be eventually consistent, and all maps in schedulerHealth are concurrent map, so we should expect it is consistent in most of the time.
          Hide
          mding MENG DING added a comment -

          Hi, Wangda Tan

          IIUC, after this patch, increase/decrease container logic needs to acquire LeafQueue's lock. Since container allocation/release acquires Leafqueue's lock too, race condition of container/resource will be avoided.

          Yes, exactly.

          One question not related to the patch, it looks safe to remove synchronized lock of CS#completedContainerInternal, correct?

          I think we don't need to synchronize the entire function with cs lock, only the part that updates the schedulerHealth. If you think this is worth fixing, I will log a separate ticket.

          Show
          mding MENG DING added a comment - Hi, Wangda Tan IIUC, after this patch, increase/decrease container logic needs to acquire LeafQueue's lock. Since container allocation/release acquires Leafqueue's lock too, race condition of container/resource will be avoided. Yes, exactly. One question not related to the patch, it looks safe to remove synchronized lock of CS#completedContainerInternal, correct? I think we don't need to synchronize the entire function with cs lock, only the part that updates the schedulerHealth . If you think this is worth fixing, I will log a separate ticket.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks MENG DING for working on this patch,

          IIUC, after this patch, increase/decrease container logic needs to acquire LeafQueue's lock. Since container allocation/release acquires Leafqueue's lock too, race condition of container/resource will be avoided.

          One question not related to the patch, it looks safe to remove synchronized lock of CS#completedContainerInternal, correct?

          Show
          leftnoteasy Wangda Tan added a comment - Thanks MENG DING for working on this patch, IIUC, after this patch, increase/decrease container logic needs to acquire LeafQueue's lock. Since container allocation/release acquires Leafqueue's lock too, race condition of container/resource will be avoided. One question not related to the patch, it looks safe to remove synchronized lock of CS#completedContainerInternal, correct?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 42s trunk passed
          +1 compile 0m 27s trunk passed with JDK v1.8.0_66
          +1 compile 0m 30s trunk passed with JDK v1.7.0_91
          +1 checkstyle 0m 18s trunk passed
          +1 mvnsite 0m 38s trunk passed
          +1 mvneclipse 0m 16s trunk passed
          +1 findbugs 1m 11s trunk passed
          +1 javadoc 0m 22s trunk passed with JDK v1.8.0_66
          +1 javadoc 0m 27s trunk passed with JDK v1.7.0_91
          +1 mvninstall 0m 32s the patch passed
          +1 compile 0m 25s the patch passed with JDK v1.8.0_66
          +1 javac 0m 25s the patch passed
          +1 compile 0m 28s the patch passed with JDK v1.7.0_91
          +1 javac 0m 28s the patch passed
          +1 checkstyle 0m 17s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 0 new + 361 unchanged - 8 fixed = 361 total (was 369)
          +1 mvnsite 0m 34s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 1m 20s the patch passed
          +1 javadoc 0m 18s the patch passed with JDK v1.8.0_66
          +1 javadoc 0m 25s the patch passed with JDK v1.7.0_91
          -1 unit 61m 49s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66.
          -1 unit 63m 14s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91.
          +1 asflicense 0m 18s Patch does not generate ASF License warnings.
          142m 51s



          Reason Tests
          JDK v1.8.0_66 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
            hadoop.yarn.server.resourcemanager.TestAMAuthorization
          JDK v1.7.0_91 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
            hadoop.yarn.server.resourcemanager.TestAMAuthorization



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12784264/YARN-4519.3.patch
          JIRA Issue YARN-4519
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux dae87c35168f 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / d62b4a4
          Default Java 1.7.0_91
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/10384/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/10384/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/10384/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-YARN-Build/10384/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt
          JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10384/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Max memory used 76MB
          Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/10384/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 42s trunk passed +1 compile 0m 27s trunk passed with JDK v1.8.0_66 +1 compile 0m 30s trunk passed with JDK v1.7.0_91 +1 checkstyle 0m 18s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 1m 11s trunk passed +1 javadoc 0m 22s trunk passed with JDK v1.8.0_66 +1 javadoc 0m 27s trunk passed with JDK v1.7.0_91 +1 mvninstall 0m 32s the patch passed +1 compile 0m 25s the patch passed with JDK v1.8.0_66 +1 javac 0m 25s the patch passed +1 compile 0m 28s the patch passed with JDK v1.7.0_91 +1 javac 0m 28s the patch passed +1 checkstyle 0m 17s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 0 new + 361 unchanged - 8 fixed = 361 total (was 369) +1 mvnsite 0m 34s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 20s the patch passed +1 javadoc 0m 18s the patch passed with JDK v1.8.0_66 +1 javadoc 0m 25s the patch passed with JDK v1.7.0_91 -1 unit 61m 49s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. -1 unit 63m 14s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. +1 asflicense 0m 18s Patch does not generate ASF License warnings. 142m 51s Reason Tests JDK v1.8.0_66 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestAMAuthorization JDK v1.7.0_91 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestAMAuthorization Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12784264/YARN-4519.3.patch JIRA Issue YARN-4519 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux dae87c35168f 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / d62b4a4 Default Java 1.7.0_91 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/10384/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/10384/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/10384/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-YARN-Build/10384/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10384/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Max memory used 76MB Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/10384/console This message was automatically generated.
          Hide
          jianhe Jian He added a comment -

          looks good to me overall, thanks for the effort !

          Show
          jianhe Jian He added a comment - looks good to me overall, thanks for the effort !
          Hide
          mding MENG DING added a comment -

          Jian He and I had a discussion offline. Instead of grabbing the CS lock for all three actions (i.e., update increase requests, decrease, and rollback), we only need to grab the queue lock for those actions.

          Attaching a new patch that implement this. It is a little bit more complicated than I had thought:

          • AbstractYarnScheduler.checkAndNormalizeContainerChangeRequest is changed to AbstractYarnScheduler.createSchedContainerChangeRequests, and it will NOT perform the RMServerUtils.checkAndNormalizeContainerChangeRequest, as the check will access the container resource. It should not be done without a queue lock.
          • The RMServerUtils.checkAndNormalizeContainerChangeRequest is changed to RMServerUtils.checkSchedContainerChangeRequest. It is called in LeafQueue.decreaseContainer and CapacityScheduelr.updateIncreaseRequests, and is synchronized with queue lock.
          • The CapacityScheduler.updateIncreaseRequests and CapacityScheduler.decreaseContainer are not synchronized with CS lock anymore.
          • The normalization of the target resource is moved to RMServerUtils.validateIncreaseDecreaseRequest.
          • The bulk of the decrease resource logic is moved to LeafQueue.decreaseContainer (guarded by queue lock), which does the following:
            • Make sure target resource <= original resource
            • If there is an existing increase request for the same container, remove it
            • If target resource == original resource, don't do anything, otherwise, release the delta resource, and notify the app, node, and then the parent queue.
          Show
          mding MENG DING added a comment - Jian He and I had a discussion offline. Instead of grabbing the CS lock for all three actions (i.e., update increase requests, decrease, and rollback), we only need to grab the queue lock for those actions. Attaching a new patch that implement this. It is a little bit more complicated than I had thought: AbstractYarnScheduler.checkAndNormalizeContainerChangeRequest is changed to AbstractYarnScheduler.createSchedContainerChangeRequests , and it will NOT perform the RMServerUtils.checkAndNormalizeContainerChangeRequest , as the check will access the container resource. It should not be done without a queue lock. The RMServerUtils.checkAndNormalizeContainerChangeRequest is changed to RMServerUtils.checkSchedContainerChangeRequest . It is called in LeafQueue.decreaseContainer and CapacityScheduelr.updateIncreaseRequests , and is synchronized with queue lock. The CapacityScheduler.updateIncreaseRequests and CapacityScheduler.decreaseContainer are not synchronized with CS lock anymore. The normalization of the target resource is moved to RMServerUtils.validateIncreaseDecreaseRequest . The bulk of the decrease resource logic is moved to LeafQueue.decreaseContainer (guarded by queue lock), which does the following: Make sure target resource <= original resource If there is an existing increase request for the same container, remove it If target resource == original resource, don't do anything, otherwise, release the delta resource, and notify the app, node, and then the parent queue.
          Hide
          mding MENG DING added a comment -

          The JavaDoc and unit tests error are not related to this issue.

          Show
          mding MENG DING added a comment - The JavaDoc and unit tests error are not related to this issue.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 37s trunk passed
          +1 compile 0m 27s trunk passed with JDK v1.8.0_66
          +1 compile 0m 32s trunk passed with JDK v1.7.0_91
          +1 checkstyle 0m 15s trunk passed
          +1 mvnsite 0m 38s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 1m 13s trunk passed
          -1 javadoc 0m 22s hadoop-yarn-server-resourcemanager in trunk failed with JDK v1.8.0_66.
          +1 javadoc 0m 27s trunk passed with JDK v1.7.0_91
          +1 mvninstall 0m 31s the patch passed
          +1 compile 0m 26s the patch passed with JDK v1.8.0_66
          +1 javac 0m 26s the patch passed
          +1 compile 0m 28s the patch passed with JDK v1.7.0_91
          +1 javac 0m 28s the patch passed
          -1 checkstyle 0m 15s Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager (total was 145, now 144).
          +1 mvnsite 0m 34s the patch passed
          +1 mvneclipse 0m 12s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 1m 17s the patch passed
          -1 javadoc 0m 22s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66.
          +1 javadoc 0m 26s the patch passed with JDK v1.7.0_91
          -1 unit 64m 39s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66.
          -1 unit 65m 45s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91.
          +1 asflicense 0m 19s Patch does not generate ASF License warnings.
          148m 15s



          Reason Tests
          JDK v1.8.0_66 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
            hadoop.yarn.server.resourcemanager.TestAMAuthorization
          JDK v1.7.0_91 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
            hadoop.yarn.server.resourcemanager.TestAMAuthorization



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12781333/YARN-4519.2.patch
          JIRA Issue YARN-4519
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 5cd526d2ed4e 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 109e528
          Default Java 1.7.0_91
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91
          findbugs v3.0.0
          javadoc https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          javadoc https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt
          JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10214/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Max memory used 76MB
          Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/10214/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 37s trunk passed +1 compile 0m 27s trunk passed with JDK v1.8.0_66 +1 compile 0m 32s trunk passed with JDK v1.7.0_91 +1 checkstyle 0m 15s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 1m 13s trunk passed -1 javadoc 0m 22s hadoop-yarn-server-resourcemanager in trunk failed with JDK v1.8.0_66. +1 javadoc 0m 27s trunk passed with JDK v1.7.0_91 +1 mvninstall 0m 31s the patch passed +1 compile 0m 26s the patch passed with JDK v1.8.0_66 +1 javac 0m 26s the patch passed +1 compile 0m 28s the patch passed with JDK v1.7.0_91 +1 javac 0m 28s the patch passed -1 checkstyle 0m 15s Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager (total was 145, now 144). +1 mvnsite 0m 34s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 17s the patch passed -1 javadoc 0m 22s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. +1 javadoc 0m 26s the patch passed with JDK v1.7.0_91 -1 unit 64m 39s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. -1 unit 65m 45s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. +1 asflicense 0m 19s Patch does not generate ASF License warnings. 148m 15s Reason Tests JDK v1.8.0_66 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestAMAuthorization JDK v1.7.0_91 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestAMAuthorization Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12781333/YARN-4519.2.patch JIRA Issue YARN-4519 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 5cd526d2ed4e 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 109e528 Default Java 1.7.0_91 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91 findbugs v3.0.0 javadoc https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt javadoc https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-YARN-Build/10214/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10214/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Max memory used 76MB Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/10214/console This message was automatically generated.
          Hide
          mding MENG DING added a comment -

          Attaching new patch that addressed the failed test case.

          We only need to grab cs lock when decrease/increase requests are not empty.

          Show
          mding MENG DING added a comment - Attaching new patch that addressed the failed test case. We only need to grab cs lock when decrease/increase requests are not empty.
          Hide
          mding MENG DING added a comment -

          Please ignore the previous patch. I think there is way for improvement.

          Show
          mding MENG DING added a comment - Please ignore the previous patch. I think there is way for improvement.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 24s trunk passed
          +1 compile 0m 26s trunk passed with JDK v1.8.0_66
          +1 compile 0m 31s trunk passed with JDK v1.7.0_91
          +1 checkstyle 0m 14s trunk passed
          +1 mvnsite 0m 36s trunk passed
          +1 mvneclipse 0m 14s trunk passed
          +1 findbugs 1m 10s trunk passed
          +1 javadoc 0m 21s trunk passed with JDK v1.8.0_66
          +1 javadoc 0m 27s trunk passed with JDK v1.7.0_91
          +1 mvninstall 0m 30s the patch passed
          +1 compile 0m 23s the patch passed with JDK v1.8.0_66
          +1 javac 0m 23s the patch passed
          +1 compile 0m 27s the patch passed with JDK v1.7.0_91
          +1 javac 0m 27s the patch passed
          -1 checkstyle 0m 15s Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager (total was 145, now 144).
          +1 mvnsite 0m 34s the patch passed
          +1 mvneclipse 0m 12s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 1m 16s the patch passed
          +1 javadoc 0m 18s the patch passed with JDK v1.8.0_66
          +1 javadoc 0m 23s the patch passed with JDK v1.7.0_91
          -1 unit 59m 39s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66.
          -1 unit 60m 16s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91.
          +1 asflicense 0m 16s Patch does not generate ASF License warnings.
          136m 58s



          Reason Tests
          JDK v1.8.0_66 Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
          JDK v1.7.0_91 Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12781009/YARN-4519.1.patch
          JIRA Issue YARN-4519
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux ab89cefe1223 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 52b7757
          Default Java 1.7.0_91
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/10192/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/10192/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/10192/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/10192/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-YARN-Build/10192/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt
          JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10192/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Max memory used 76MB
          Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/10192/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 24s trunk passed +1 compile 0m 26s trunk passed with JDK v1.8.0_66 +1 compile 0m 31s trunk passed with JDK v1.7.0_91 +1 checkstyle 0m 14s trunk passed +1 mvnsite 0m 36s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 1m 10s trunk passed +1 javadoc 0m 21s trunk passed with JDK v1.8.0_66 +1 javadoc 0m 27s trunk passed with JDK v1.7.0_91 +1 mvninstall 0m 30s the patch passed +1 compile 0m 23s the patch passed with JDK v1.8.0_66 +1 javac 0m 23s the patch passed +1 compile 0m 27s the patch passed with JDK v1.7.0_91 +1 javac 0m 27s the patch passed -1 checkstyle 0m 15s Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager (total was 145, now 144). +1 mvnsite 0m 34s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 16s the patch passed +1 javadoc 0m 18s the patch passed with JDK v1.8.0_66 +1 javadoc 0m 23s the patch passed with JDK v1.7.0_91 -1 unit 59m 39s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. -1 unit 60m 16s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. +1 asflicense 0m 16s Patch does not generate ASF License warnings. 136m 58s Reason Tests JDK v1.8.0_66 Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler JDK v1.7.0_91 Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12781009/YARN-4519.1.patch JIRA Issue YARN-4519 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux ab89cefe1223 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 52b7757 Default Java 1.7.0_91 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/10192/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/10192/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/10192/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/10192/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-YARN-Build/10192/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10192/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Max memory used 76MB Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/10192/console This message was automatically generated.
          Hide
          mding MENG DING added a comment -

          Attaching the latest patch that addresses this issue:

          We need to make sure following operations are under same CS synchronization lock:

          1. Compute delta resource for increase request and insert to application
          2. Compute delta resource for decrease request and call CS.decreaseContainer
          3. Rollback action

          1 and 2 are addressed in this patch. 3 will be addressed in YARN-4138.

          Show
          mding MENG DING added a comment - Attaching the latest patch that addresses this issue: We need to make sure following operations are under same CS synchronization lock: 1. Compute delta resource for increase request and insert to application 2. Compute delta resource for decrease request and call CS.decreaseContainer 3. Rollback action 1 and 2 are addressed in this patch. 3 will be addressed in YARN-4138 .
          Hide
          sandflee sandflee added a comment -

          got it, thanks MENG DING!

          Show
          sandflee sandflee added a comment - got it, thanks MENG DING !
          Hide
          mding MENG DING added a comment -

          Hi, sandflee

          The delta resource is the difference between currently allocated resource and the target resource (or in the case of rollback, the difference between currently allocated resource and the last confirmed resource). For decrease request, for example, we need to put a CS lock around computing delta resource and CS.decreaseContainer, otherwise the currently allocated resource might have been changed in the middle by the scheduling thread causing the delta resource to be outdated.

          decreaseContainer updates core scheduler statistics, it must be locked.

          Show
          mding MENG DING added a comment - Hi, sandflee The delta resource is the difference between currently allocated resource and the target resource (or in the case of rollback, the difference between currently allocated resource and the last confirmed resource). For decrease request, for example, we need to put a CS lock around computing delta resource and CS.decreaseContainer, otherwise the currently allocated resource might have been changed in the middle by the scheduling thread causing the delta resource to be outdated. decreaseContainer updates core scheduler statistics, it must be locked.
          Hide
          sandflee sandflee added a comment -

          sorry, I don't understand
          1, why should we put compute delta resource under CS lock?
          2, why should we put decreaseContainer under CS lock?
          to avoid what race conditions? thanks.

          Show
          sandflee sandflee added a comment - sorry, I don't understand 1, why should we put compute delta resource under CS lock? 2, why should we put decreaseContainer under CS lock? to avoid what race conditions? thanks.
          Hide
          mding MENG DING added a comment -

          Agreed

          Show
          mding MENG DING added a comment - Agreed
          Hide
          leftnoteasy Wangda Tan added a comment -

          MENG DING,
          I haven't started, please go ahead if you're interested.

          We need to make sure following operations are under same CS synchronization lock:
          1. Compute delta resource for increase request and insert to application
          2. Compute delta resource for decrease request and call CS.decreaseContainer
          3. Rollback action

          Race could happen if we compute delta resource under one CS lock but insert request under another CS lock.

          Agree?

          Show
          leftnoteasy Wangda Tan added a comment - MENG DING , I haven't started, please go ahead if you're interested. We need to make sure following operations are under same CS synchronization lock: 1. Compute delta resource for increase request and insert to application 2. Compute delta resource for decrease request and call CS.decreaseContainer 3. Rollback action Race could happen if we compute delta resource under one CS lock but insert request under another CS lock. Agree?
          Hide
          mding MENG DING added a comment -

          Yes the race only happens when computing delta resource, and yes it also happens to increase request.

          If so, can we set delta resource of SchedContainerChangeRequest when we enter decreaseContainer?

          I guess whichever way we take, we need to make sure the delta resource is computed in the scope of a CS lock (i.e., delta resource for decrease request, increase request, and rollback action)

          I can take a shot at working out a patch if nobody is working on that yet.

          Show
          mding MENG DING added a comment - Yes the race only happens when computing delta resource, and yes it also happens to increase request. If so, can we set delta resource of SchedContainerChangeRequest when we enter decreaseContainer? I guess whichever way we take, we need to make sure the delta resource is computed in the scope of a CS lock (i.e., delta resource for decrease request, increase request, and rollback action) I can take a shot at working out a patch if nobody is working on that yet.
          Hide
          leftnoteasy Wangda Tan added a comment -

          I also think that it is necessary to move the logic of creating normalizedDecreaseRequests (i.e. SchedContainerChangeRequest) into the decreaseContainer() call (under the scope of CS lock), otherwise there would be race condition when creating the delta resources. What do you think?

          It seems to me that race could only happens when computing delta resource, correct? If so, can we set delta resource of SchedContainerChangeRequest when we enter decreaseContainer?
          And also, updateIncreaseRequests could have same race condition, does it?

          Show
          leftnoteasy Wangda Tan added a comment - I also think that it is necessary to move the logic of creating normalizedDecreaseRequests (i.e. SchedContainerChangeRequest) into the decreaseContainer() call (under the scope of CS lock), otherwise there would be race condition when creating the delta resources. What do you think? It seems to me that race could only happens when computing delta resource, correct? If so, can we set delta resource of SchedContainerChangeRequest when we enter decreaseContainer? And also, updateIncreaseRequests could have same race condition, does it?
          Hide
          mding MENG DING added a comment -

          This approach would be simpler, at the expense of acquiring a CS lock in the allocate call (though no worse than existing logic).

          I also think that it is necessary to move the logic of creating normalizedDecreaseRequests (i.e. SchedContainerChangeRequest) into the decreaseContainer() call (under the scope of CS lock), otherwise there would be race condition when creating the delta resources. What do you think?

          Show
          mding MENG DING added a comment - This approach would be simpler, at the expense of acquiring a CS lock in the allocate call (though no worse than existing logic). I also think that it is necessary to move the logic of creating normalizedDecreaseRequests (i.e. SchedContainerChangeRequest) into the decreaseContainer() call (under the scope of CS lock), otherwise there would be race condition when creating the delta resources. What do you think?
          Hide
          leftnoteasy Wangda Tan added a comment - - edited

          Thanks Jian He found this issue and analysis from sandflee/MENG DING.

          I think the simplest solution could be, move

               // Decrease containers
                decreaseContainers(normalizedDecreaseRequests, application);
          

          Out of the synchronized lock of application:

              synchronized (application) {
                     //...
             }
             // put it here.
          

          And also, in AbstractYarnScheduler#decreaseContainers,
          It's better to move

                boolean hasIncreaseRequest =
                    attempt.removeIncreaseRequest(request.getNodeId(),
                        request.getPriority(), request.getContainerId());
          

          Into decreaseContainer.

          After above changes, decrease a container needs to acquire CS lock first. And YARN-4138 can directly use decreaseContainer to rolllback container.

          Thoughts?

          Show
          leftnoteasy Wangda Tan added a comment - - edited Thanks Jian He found this issue and analysis from sandflee / MENG DING . I think the simplest solution could be, move // Decrease containers decreaseContainers(normalizedDecreaseRequests, application); Out of the synchronized lock of application: synchronized (application) { //... } // put it here. And also, in AbstractYarnScheduler#decreaseContainers , It's better to move boolean hasIncreaseRequest = attempt.removeIncreaseRequest(request.getNodeId(), request.getPriority(), request.getContainerId()); Into decreaseContainer . After above changes, decrease a container needs to acquire CS lock first. And YARN-4138 can directly use decreaseContainer to rolllback container. Thoughts?
          Hide
          mding MENG DING added a comment -

          I feel that the correct solution would be simply put all decrease requests into a pendingDecrease list in the allocate() call (after some initial sanity checks, of course). And in the allocateContainersToNode() call, process all the pendingDecrease requests first before allocating new/increase resource. This would make it easy for the resource rollback too.

          Also, the following code may have issues?

          CapacityScheduler.allocate
          // Pre-process increase requests
              List<SchedContainerChangeRequest> normalizedIncreaseRequests =
                  checkAndNormalizeContainerChangeRequests(increaseRequests, true);
          
              // Pre-process decrease requests
              List<SchedContainerChangeRequest> normalizedDecreaseRequests =
                  checkAndNormalizeContainerChangeRequests(decreaseRequests, false);
          

          There could be race conditions when calculating the delta resource for the SchedContainerchangeRequest, since the above code is not synchronized with the scheduler?

          Thoughts, Wangda Tan?

          Show
          mding MENG DING added a comment - I feel that the correct solution would be simply put all decrease requests into a pendingDecrease list in the allocate() call (after some initial sanity checks, of course). And in the allocateContainersToNode() call, process all the pendingDecrease requests first before allocating new/increase resource. This would make it easy for the resource rollback too. Also, the following code may have issues? CapacityScheduler.allocate // Pre-process increase requests List<SchedContainerChangeRequest> normalizedIncreaseRequests = checkAndNormalizeContainerChangeRequests(increaseRequests, true ); // Pre-process decrease requests List<SchedContainerChangeRequest> normalizedDecreaseRequests = checkAndNormalizeContainerChangeRequests(decreaseRequests, false ); There could be race conditions when calculating the delta resource for the SchedContainerchangeRequest, since the above code is not synchronized with the scheduler? Thoughts, Wangda Tan ?

            People

            • Assignee:
              mding MENG DING
              Reporter:
              sandflee sandflee
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development