Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3091 [Umbrella] Improve and fix locks of RM scheduler
  3. YARN-6025

Fix synchronization issues of AbstractYarnScheduler#nodeUpdate and its implementations

    Details

    • Hadoop Flags:
      Reviewed

      Description

      YARN-3139 does optimization on the locks by introducing ReentrantReadWriteLock to remove synchronized locks, but still seems to have some synchronization locks in AbstractYarnScheduler and its hierarchy because of nonavailibility of read write locks in Fifo scheduler.

      1. YARN-6025.002.patch
        10 kB
        Naganarasimha G R
      2. YARN-6025.01.patch
        8 kB
        Naganarasimha G R

        Activity

        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Hi Tan, Wangda ,
        Hope you could have a look at it, as you had done the earlier modifications. I am not completely sure about my 3rd comment on AbstractYarnScheduler, hence not incorporating it in the patch.

        Show
        Naganarasimha Naganarasimha G R added a comment - Hi Tan, Wangda , Hope you could have a look at it, as you had done the earlier modifications. I am not completely sure about my 3rd comment on AbstractYarnScheduler, hence not incorporating it in the patch.
        Hide
        leftnoteasy Wangda Tan added a comment -

        Thanks Naganarasimha Garla,

        Minor comments:
        1) Should not remove synchronized lock of recoverContainersOnNode / nodeUpdate in AbstractYarnScheduler, as they are used by FifoScheduler
        2) Should keep readlock inside CS#nodeUpdate, since it is possible that collection of applications / queues / nodes could be updated during the call. The readlock is added to protect this happens.

        Show
        leftnoteasy Wangda Tan added a comment - Thanks Naganarasimha Garla , Minor comments: 1) Should not remove synchronized lock of recoverContainersOnNode / nodeUpdate in AbstractYarnScheduler, as they are used by FifoScheduler 2) Should keep readlock inside CS#nodeUpdate, since it is possible that collection of applications / queues / nodes could be updated during the call. The readlock is added to protect this happens.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 13s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 12m 58s trunk passed
        +1 compile 0m 33s trunk passed
        +1 checkstyle 0m 21s trunk passed
        +1 mvnsite 0m 34s trunk passed
        +1 mvneclipse 0m 16s trunk passed
        +1 findbugs 1m 3s trunk passed
        +1 javadoc 0m 23s trunk passed
        +1 mvninstall 0m 30s the patch passed
        +1 compile 0m 31s the patch passed
        +1 javac 0m 31s the patch passed
        +1 checkstyle 0m 19s the patch passed
        +1 mvnsite 0m 32s the patch passed
        +1 mvneclipse 0m 13s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 2s the patch passed
        +1 javadoc 0m 18s the patch passed
        -1 unit 38m 48s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 17s The patch does not generate ASF License warnings.
        60m 10s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:a9ad5d6
        JIRA Issue YARN-6025
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12844489/YARN-6025.01.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 8b2f53f577dd 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 22befbd
        Default Java 1.8.0_111
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/14448/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14448/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/14448/console
        Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 12m 58s trunk passed +1 compile 0m 33s trunk passed +1 checkstyle 0m 21s trunk passed +1 mvnsite 0m 34s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 1m 3s trunk passed +1 javadoc 0m 23s trunk passed +1 mvninstall 0m 30s the patch passed +1 compile 0m 31s the patch passed +1 javac 0m 31s the patch passed +1 checkstyle 0m 19s the patch passed +1 mvnsite 0m 32s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 2s the patch passed +1 javadoc 0m 18s the patch passed -1 unit 38m 48s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 60m 10s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6025 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12844489/YARN-6025.01.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 8b2f53f577dd 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 22befbd Default Java 1.8.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/14448/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14448/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14448/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Thanks Tan, Wangda,

        Should not remove synchronized lock of recoverContainersOnNode / nodeUpdate in AbstractYarnScheduler, as they are used by FifoScheduler

        IMHO FifoScheduler should extend and get it synchronized if required (nodeUpdate is already done in that way) because other schedulers are not dependent on it and better to avoid holding both kind of locks (reentrant lock and object's lock) for other schedulers.

        Should keep readlock inside CS#nodeUpdate, since it is possible that collection of applications / queues / nodes could be updated during the call. The readlock is added to protect this happens.

        May be i dint get it properly but based on my understanding either we need to hold write lock in CS and invoke super.nodeUpdate or ensure inside the AbstractYarnScheduler sufficient locking is there so that downstream classes need not worry. And further updates to applications, & queues happens with holding locks internally, so was not clear whether its required.

        Show
        Naganarasimha Naganarasimha G R added a comment - Thanks Tan, Wangda , Should not remove synchronized lock of recoverContainersOnNode / nodeUpdate in AbstractYarnScheduler, as they are used by FifoScheduler IMHO FifoScheduler should extend and get it synchronized if required (nodeUpdate is already done in that way) because other schedulers are not dependent on it and better to avoid holding both kind of locks (reentrant lock and object's lock) for other schedulers. Should keep readlock inside CS#nodeUpdate, since it is possible that collection of applications / queues / nodes could be updated during the call. The readlock is added to protect this happens. May be i dint get it properly but based on my understanding either we need to hold write lock in CS and invoke super.nodeUpdate or ensure inside the AbstractYarnScheduler sufficient locking is there so that downstream classes need not worry. And further updates to applications, & queues happens with holding locks internally, so was not clear whether its required.
        Hide
        leftnoteasy Wangda Tan added a comment -

        IMHO FifoScheduler should extend and get it synchronized if required (nodeUpdate is already done in that way).

        Extend in FifoScheduler sounds like a better solution.

        May be i dint get it properly but based on my understanding either we need to hold write lock in CS and invoke super.nodeUpdate..

        YARN-5706 changes to use readLock for part of the nodeUpdate and writeLock for rest of the part, two reasons:

        • Use readLock to prevent add/remove of apps / queues / nodes, and to reduce critical section. Because any changes to above collections need to hold writelock.
        • Use writeLock to for container allocation.
        Show
        leftnoteasy Wangda Tan added a comment - IMHO FifoScheduler should extend and get it synchronized if required (nodeUpdate is already done in that way). Extend in FifoScheduler sounds like a better solution. May be i dint get it properly but based on my understanding either we need to hold write lock in CS and invoke super.nodeUpdate.. YARN-5706 changes to use readLock for part of the nodeUpdate and writeLock for rest of the part, two reasons: Use readLock to prevent add/remove of apps / queues / nodes, and to reduce critical section. Because any changes to above collections need to hold writelock. Use writeLock to for container allocation.
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        My bad. i wrongly read the code as CS holding its private lock (different from the AbstractYarnScheduler) because of which i had given earlier few comments. Now i realize many things clearly. Thanks to Sunil G (for offline chat) and Tan, Wangda for clarifying it.
        But just one more doubt.

        FairScheduler holds write lock while doing super.nodeUpdate(nm) and CS holds read lock, should it not be that we should hold a lock inside AbstractYarnScheduler.nodeUpdate(RMNode) as AbstractYarnScheduler.nodeUpdate is mainly sending out the events for the container finished, updating the health status. In fact all the AYS.nodeUpdate operations could be done even without any lock so multiple nodes could update these states in parallel, but may be read lock is required so that write lock will wait for all readlock operations to finish and then do operations under it.

        For the first point "removing the syncronization on nodeUpdate and extending and keeping it in FifoScheduler" will update the patch.

        Show
        Naganarasimha Naganarasimha G R added a comment - My bad. i wrongly read the code as CS holding its private lock (different from the AbstractYarnScheduler) because of which i had given earlier few comments. Now i realize many things clearly. Thanks to Sunil G (for offline chat) and Tan, Wangda for clarifying it. But just one more doubt. FairScheduler holds write lock while doing super.nodeUpdate(nm) and CS holds read lock, should it not be that we should hold a lock inside AbstractYarnScheduler.nodeUpdate(RMNode) as AbstractYarnScheduler.nodeUpdate is mainly sending out the events for the container finished, updating the health status. In fact all the AYS.nodeUpdate operations could be done even without any lock so multiple nodes could update these states in parallel, but may be read lock is required so that write lock will wait for all readlock operations to finish and then do operations under it. For the first point "removing the syncronization on nodeUpdate and extending and keeping it in FifoScheduler" will update the patch.
        Hide
        leftnoteasy Wangda Tan added a comment -

        But just one more doubt....

        In my mind, methods in AYS is majorly for better code reuse, schedulers can use treat it more like libraries. And locking approach can be varied for different schedulers, this is the safest/most flexible way to me.

        Show
        leftnoteasy Wangda Tan added a comment - But just one more doubt.... In my mind, methods in AYS is majorly for better code reuse, schedulers can use treat it more like libraries. And locking approach can be varied for different schedulers, this is the safest/most flexible way to me.
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Ok in that case will consider only removing of synchronization for nodeUpdate and keep it specific to FIFO scheduler as part of this jira.

        Show
        Naganarasimha Naganarasimha G R added a comment - Ok in that case will consider only removing of synchronization for nodeUpdate and keep it specific to FIFO scheduler as part of this jira.
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        attaching a patch as per the last discussion !

        Show
        Naganarasimha Naganarasimha G R added a comment - attaching a patch as per the last discussion !
        Hide
        sunilg Sunil G added a comment -

        Generally patch looks fine for me.

        Show
        sunilg Sunil G added a comment - Generally patch looks fine for me.
        Hide
        leftnoteasy Wangda Tan added a comment -

        +1, pending Jenkins. Thanks Naganarasimha G R

        Show
        leftnoteasy Wangda Tan added a comment - +1, pending Jenkins. Thanks Naganarasimha G R
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 14s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 13m 35s trunk passed
        +1 compile 0m 31s trunk passed
        +1 checkstyle 0m 23s trunk passed
        +1 mvnsite 0m 34s trunk passed
        +1 mvneclipse 0m 16s trunk passed
        +1 findbugs 0m 58s trunk passed
        +1 javadoc 0m 20s trunk passed
        +1 mvninstall 0m 30s the patch passed
        +1 compile 0m 29s the patch passed
        +1 javac 0m 29s the patch passed
        +1 checkstyle 0m 20s the patch passed
        +1 mvnsite 0m 31s the patch passed
        +1 mvneclipse 0m 14s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 6s the patch passed
        +1 javadoc 0m 19s the patch passed
        -1 unit 38m 41s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 18s The patch does not generate ASF License warnings.
        60m 35s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:a9ad5d6
        JIRA Issue YARN-6025
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12845244/YARN-6025.002.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 2e8598881bf8 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 591fb15
        Default Java 1.8.0_111
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/14531/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14531/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/14531/console
        Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 13m 35s trunk passed +1 compile 0m 31s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 34s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 0m 58s trunk passed +1 javadoc 0m 20s trunk passed +1 mvninstall 0m 30s the patch passed +1 compile 0m 29s the patch passed +1 javac 0m 29s the patch passed +1 checkstyle 0m 20s the patch passed +1 mvnsite 0m 31s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 6s the patch passed +1 javadoc 0m 19s the patch passed -1 unit 38m 41s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 60m 35s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6025 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12845244/YARN-6025.002.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 2e8598881bf8 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 591fb15 Default Java 1.8.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/14531/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14531/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14531/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        leftnoteasy Wangda Tan added a comment -

        Test failure is not related, it is tracked by: YARN-5548. Committing it now.

        Show
        leftnoteasy Wangda Tan added a comment - Test failure is not related, it is tracked by: YARN-5548 . Committing it now.
        Hide
        leftnoteasy Wangda Tan added a comment -

        Committed to trunk/branch-2, thanks Naganarasimha G R for working on this, and thanks reviews from Sunil G!

        Show
        leftnoteasy Wangda Tan added a comment - Committed to trunk/branch-2, thanks Naganarasimha G R for working on this, and thanks reviews from Sunil G !
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11068 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11068/)
        YARN-6025. Fix synchronization issues of (wangda: rev f69a107aeccc68ca1085a7be8093d36b2f45eaa1)

        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11068 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11068/ ) YARN-6025 . Fix synchronization issues of (wangda: rev f69a107aeccc68ca1085a7be8093d36b2f45eaa1) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Thanks for the review and commit Wangda Tan and additional review from Sunil G

        Show
        Naganarasimha Naganarasimha G R added a comment - Thanks for the review and commit Wangda Tan and additional review from Sunil G

          People

          • Assignee:
            Naganarasimha Naganarasimha G R
            Reporter:
            Naganarasimha Naganarasimha G R
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development