Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5920

Fix deadlock in TestRMHA.testTransitionedToStandbyShouldNotHang

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-alpha2
    • Component/s: test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In build linkg test case timed out. This need to be investigated.

      1. ThreadDump.txt
        72 kB
        Rohith Sharma K S
      2. YARN-5920.01.patch
        3 kB
        Varun Saxena
      3. YARN-5920.02.patch
        1 kB
        Varun Saxena

        Activity

        Hide
        varun_saxena Varun Saxena added a comment -

        This test is failing due to a deadlock.

        When RM transitions to active, we store RM delegation token master key in state store. For this we put the state store event in AsyncDispatcher.
        After event is picked up from AsyncDispatcher, we call RMStateStore#handleStoreEvent where we acquire a write lock. Then from StoreRMDTMasterKeyTransition, we will call MemoryRMStateStore#storeRMDTMasterKeyState which is a synchronized method.

        Now in TestRMHA, we override updateApplicationState in MemoryRMStateStore which is also synchronized. By overriding this method, we are bypassing RMStateStore i.e. when in test we call rm.getRMContext().getStateStore().updateApplicationState(null), we do not try to acquire write lock in RMStateStore. When updateApplicationState calls notifyStoreOperationFailed, we will call RMStateStore#isFencedState which leads to acquiring of read lock or call RMStateStore#updateFencedState which will lead to acquiring of write lock.

        Now due to race, if MemoryRMStateStore#updateApplicationState is called before MemoryRMStateStore#storeRMDTMasterKeyState is called but after RMStateStore#storeRMDTMasterKey is called, there can be a deadlock.
        This is because the thread calling notifyStoreOperationFailed would be blocked while trying to acquire read or write lock in RMStateStore because a write lock is held by thread storing RM DT master key. Whereas the thread calling MemoryRMStateStore#storeRMDTMasterKeyState will be blocked because access to MemoryRMStateStore#updateApplicationState is synchronized and that thread is blocked on the read/write lock.

        To solve this we should override updateApplicationStateInternal in MemoryRMStateStore and RMStateStore#updateApplicationState should be invoked so that normal flow of processing state store events is followed. This will get rid of deadlock.

        This deadlock can be easily simulated by putting a sleep in StoreRMDTMasterKeyTransition#transition.

        Show
        varun_saxena Varun Saxena added a comment - This test is failing due to a deadlock. When RM transitions to active, we store RM delegation token master key in state store. For this we put the state store event in AsyncDispatcher. After event is picked up from AsyncDispatcher, we call RMStateStore#handleStoreEvent where we acquire a write lock. Then from StoreRMDTMasterKeyTransition, we will call MemoryRMStateStore#storeRMDTMasterKeyState which is a synchronized method. Now in TestRMHA, we override updateApplicationState in MemoryRMStateStore which is also synchronized. By overriding this method, we are bypassing RMStateStore i.e. when in test we call rm.getRMContext().getStateStore().updateApplicationState(null) , we do not try to acquire write lock in RMStateStore. When updateApplicationState calls notifyStoreOperationFailed, we will call RMStateStore#isFencedState which leads to acquiring of read lock or call RMStateStore#updateFencedState which will lead to acquiring of write lock. Now due to race, if MemoryRMStateStore#updateApplicationState is called before MemoryRMStateStore#storeRMDTMasterKeyState is called but after RMStateStore#storeRMDTMasterKey is called, there can be a deadlock. This is because the thread calling notifyStoreOperationFailed would be blocked while trying to acquire read or write lock in RMStateStore because a write lock is held by thread storing RM DT master key. Whereas the thread calling MemoryRMStateStore#storeRMDTMasterKeyState will be blocked because access to MemoryRMStateStore#updateApplicationState is synchronized and that thread is blocked on the read/write lock. To solve this we should override updateApplicationStateInternal in MemoryRMStateStore and RMStateStore#updateApplicationState should be invoked so that normal flow of processing state store events is followed. This will get rid of deadlock. This deadlock can be easily simulated by putting a sleep in StoreRMDTMasterKeyTransition#transition.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 22s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 6m 50s trunk passed
        +1 compile 0m 32s trunk passed
        +1 checkstyle 0m 21s trunk passed
        +1 mvnsite 0m 38s trunk passed
        +1 mvneclipse 0m 16s trunk passed
        +1 findbugs 0m 59s trunk passed
        +1 javadoc 0m 26s trunk passed
        +1 mvninstall 0m 41s the patch passed
        +1 compile 0m 38s the patch passed
        +1 javac 0m 38s the patch passed
        +1 checkstyle 0m 22s the patch passed
        +1 mvnsite 0m 42s the patch passed
        +1 mvneclipse 0m 17s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 22s the patch passed
        +1 javadoc 0m 22s the patch passed
        -1 unit 43m 17s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 16s The patch does not generate ASF License warnings.
        59m 41s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart
          hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:a9ad5d6
        JIRA Issue YARN-5920
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12839881/YARN-5920.01.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 1834e5667102 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 683e0c7
        Default Java 1.8.0_111
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/13999/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13999/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/13999/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 22s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 50s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 21s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 0m 59s trunk passed +1 javadoc 0m 26s trunk passed +1 mvninstall 0m 41s the patch passed +1 compile 0m 38s the patch passed +1 javac 0m 38s the patch passed +1 checkstyle 0m 22s the patch passed +1 mvnsite 0m 42s the patch passed +1 mvneclipse 0m 17s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 22s the patch passed +1 javadoc 0m 22s the patch passed -1 unit 43m 17s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 59m 41s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart   hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-5920 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12839881/YARN-5920.01.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 1834e5667102 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 683e0c7 Default Java 1.8.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/13999/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13999/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13999/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        thanks Varun analysis and patch. Regarding patch, this can be solved simply by removing synchronized from overridden method in test i.e

        MemoryRMStateStore memStore = new MemoryRMStateStore() {
              @Override
              public void updateApplicationState(
                  ApplicationStateData appState) {
                notifyStoreOperationFailed(new StoreFencedException());
              }
            };
        

        Basically test want to call method notifyStoreOperationFailed from outside of writeLock. If you see ZKRMStateStore implementation, notifyStoreOperationFailed called from thread.

        Show
        rohithsharma Rohith Sharma K S added a comment - thanks Varun analysis and patch. Regarding patch, this can be solved simply by removing synchronized from overridden method in test i.e MemoryRMStateStore memStore = new MemoryRMStateStore() { @Override public void updateApplicationState( ApplicationStateData appState) { notifyStoreOperationFailed( new StoreFencedException()); } }; Basically test want to call method notifyStoreOperationFailed from outside of writeLock. If you see ZKRMStateStore implementation, notifyStoreOperationFailed called from thread.
        Hide
        varun_saxena Varun Saxena added a comment -

        Basically test want to call method notifyStoreOperationFailed from outside of writeLock.

        Ok...Will change.

        Show
        varun_saxena Varun Saxena added a comment - Basically test want to call method notifyStoreOperationFailed from outside of writeLock. Ok...Will change.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 16s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 6m 57s trunk passed
        +1 compile 0m 32s trunk passed
        +1 checkstyle 0m 20s trunk passed
        +1 mvnsite 0m 39s trunk passed
        +1 mvneclipse 0m 17s trunk passed
        +1 findbugs 1m 0s trunk passed
        +1 javadoc 0m 21s trunk passed
        +1 mvninstall 0m 32s the patch passed
        +1 compile 0m 31s the patch passed
        +1 javac 0m 31s the patch passed
        +1 checkstyle 0m 18s the patch passed
        +1 mvnsite 0m 36s the patch passed
        +1 mvneclipse 0m 14s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 6s the patch passed
        +1 javadoc 0m 18s the patch passed
        -1 unit 38m 26s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 17s The patch does not generate ASF License warnings.
        53m 59s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:a9ad5d6
        JIRA Issue YARN-5920
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12840042/YARN-5920.02.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 23b030ef373f 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / afcf8d3
        Default Java 1.8.0_111
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/14021/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14021/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/14021/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 57s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 20s trunk passed +1 mvnsite 0m 39s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 1m 0s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 32s the patch passed +1 compile 0m 31s the patch passed +1 javac 0m 31s the patch passed +1 checkstyle 0m 18s the patch passed +1 mvnsite 0m 36s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 6s the patch passed +1 javadoc 0m 18s the patch passed -1 unit 38m 26s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 53m 59s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-5920 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12840042/YARN-5920.02.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 23b030ef373f 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / afcf8d3 Default Java 1.8.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/14021/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14021/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14021/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        +1 LGTM, will commit it shortly

        Show
        rohithsharma Rohith Sharma K S added a comment - +1 LGTM, will commit it shortly
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        committed to trunk/branch-2.. thanks Varun for the patch!!

        Show
        rohithsharma Rohith Sharma K S added a comment - committed to trunk/branch-2.. thanks Varun for the patch!!
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10886 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10886/)
        YARN-5920. Fix deadlock in (rohithsharmaks: rev e15c20edba1e9a23475ee6a4dfbadbdb8c1f668a)

        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10886 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10886/ ) YARN-5920 . Fix deadlock in (rohithsharmaks: rev e15c20edba1e9a23475ee6a4dfbadbdb8c1f668a) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
        Hide
        varun_saxena Varun Saxena added a comment -

        Thanks Rohith Sharma K S for the review and commit.

        Show
        varun_saxena Varun Saxena added a comment - Thanks Rohith Sharma K S for the review and commit.

          People

          • Assignee:
            varun_saxena Varun Saxena
            Reporter:
            rohithsharma Rohith Sharma K S
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development