Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4348

ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.7.2, 2.6.2
    • Fix Version/s: 2.7.2, 2.6.3
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Jian mentioned that the current internal ZK configuration of ZKRMStateStore can cause a following situation:

      1. syncInternal timeouts,
      2. but sync succeeded later on.

      We should use zkResyncWaitTime as the timeout value.

      1. log.txt
        7 kB
        Tsuyoshi Ozawa
      2. YARN-4348.001.patch
        1 kB
        Jian He
      3. YARN-4348.001.patch
        1 kB
        Tsuyoshi Ozawa
      4. YARN-4348-branch-2.7.002.patch
        2 kB
        Tsuyoshi Ozawa
      5. YARN-4348-branch-2.7.003.patch
        3 kB
        Tsuyoshi Ozawa
      6. YARN-4348-branch-2.7.004.patch
        3 kB
        Tsuyoshi Ozawa

        Issue Links

          Activity

          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Attaching a log file.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Attaching a log file.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          The issue looks similar to YARN-3753. One workaround is to change the timeout for the sync to zkResyncWaitTime as Jian He changed on YARN-3753. Attaching a patch for this.

          If the timeout be increased, the probability of the case will be decreased, but it can still happen. e.g. ZK's server packet for the reply against sync is dropped after the operation itself success.

          Show
          ozawa Tsuyoshi Ozawa added a comment - The issue looks similar to YARN-3753 . One workaround is to change the timeout for the sync to zkResyncWaitTime as Jian He changed on YARN-3753 . Attaching a patch for this. If the timeout be increased, the probability of the case will be decreased, but it can still happen. e.g. ZK's server packet for the reply against sync is dropped after the operation itself success.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          TestZKRMStateStore fails because of timeout. Let me survey.

          Show
          ozawa Tsuyoshi Ozawa added a comment - TestZKRMStateStore fails because of timeout. Let me survey.
          Hide
          jianhe Jian He added a comment -

          upload same patch to kick jenkins

          Show
          jianhe Jian He added a comment - upload same patch to kick jenkins
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Found that testZKRootPathAcls fails because of time out with the patch. I will check it more deeper.

          TestZKRMStateStore-output.txt
          2015-11-15 02:15:17,324 INFO  [main] zookeeper.JUnit4ZKTestRunner (JUnit4ZKTestRunner.java:evaluate(50)) - RUNNING TEST METHOD testZKRootPathAcls
          ... ...
          2015-11-15 02:30:12,774 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(88)) - Processing request:: sessionid:0x15108ecd3b20001 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
          2015-11-15 02:30:12,774 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(160)) - sessionid:0x15108ecd3b20001 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
          2015-11-15 02:30:12,775 DEBUG [main-SendThread(127.0.0.1:11221)] zookeeper.ClientCnxn (ClientCnxn.java:readResponse(717)) - Got ping response for sessionid: 0x15108ecd3b20001 after 0ms
          2015-11-15 02:30:14,776 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(88)) - Processing request:: sessionid:0x15108ecd3b20001 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
          2015-11-15 02:30:14,776 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(160)) - sessionid:0x15108ecd3b20001 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
          2015-11-15 02:30:14,776 DEBUG [main-SendThread(127.0.0.1:11221)] zookeeper.ClientCnxn (ClientCnxn.java:readResponse(717)) - Got ping response for sessionid: 0x15108ecd3b20001 after 0ms
          ~                            
          
          Show
          ozawa Tsuyoshi Ozawa added a comment - Found that testZKRootPathAcls fails because of time out with the patch. I will check it more deeper. TestZKRMStateStore-output.txt 2015-11-15 02:15:17,324 INFO [main] zookeeper.JUnit4ZKTestRunner (JUnit4ZKTestRunner.java:evaluate(50)) - RUNNING TEST METHOD testZKRootPathAcls ... ... 2015-11-15 02:30:12,774 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(88)) - Processing request:: sessionid:0x15108ecd3b20001 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a 2015-11-15 02:30:12,774 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(160)) - sessionid:0x15108ecd3b20001 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a 2015-11-15 02:30:12,775 DEBUG [main-SendThread(127.0.0.1:11221)] zookeeper.ClientCnxn (ClientCnxn.java:readResponse(717)) - Got ping response for sessionid: 0x15108ecd3b20001 after 0ms 2015-11-15 02:30:14,776 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(88)) - Processing request:: sessionid:0x15108ecd3b20001 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a 2015-11-15 02:30:14,776 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(160)) - sessionid:0x15108ecd3b20001 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a 2015-11-15 02:30:14,776 DEBUG [main-SendThread(127.0.0.1:11221)] zookeeper.ClientCnxn (ClientCnxn.java:readResponse(717)) - Got ping response for sessionid: 0x15108ecd3b20001 after 0ms ~
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          The test failure I mentioned is caused by using zkResyncWaitTime as the timeout value of sync operation - the default value of zkResyncWaitTime is smaller than zkSessionTimeout. We should use the timeout value which is larger than zkSessionTimeout, so just changing to use zkSessionTimeout * 3.

          In addition to this, we should care about the failure of sync operation at startup time to preventing RM from continuing to run in illegal state - ZK's inconsistent view.

          Attaching a patch to fix the test failure and the error handling at startup time(startInternal). Jian He, could you take a look?

          Show
          ozawa Tsuyoshi Ozawa added a comment - The test failure I mentioned is caused by using zkResyncWaitTime as the timeout value of sync operation - the default value of zkResyncWaitTime is smaller than zkSessionTimeout. We should use the timeout value which is larger than zkSessionTimeout, so just changing to use zkSessionTimeout * 3. In addition to this, we should care about the failure of sync operation at startup time to preventing RM from continuing to run in illegal state - ZK's inconsistent view. Attaching a patch to fix the test failure and the error handling at startup time(startInternal). Jian He , could you take a look?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 8s docker + precommit patch detected.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          -1 mvninstall 7m 24s root in branch-2.7 failed.
          +1 compile 0m 21s branch-2.7 passed with JDK v1.8.0_60
          +1 compile 0m 22s branch-2.7 passed with JDK v1.7.0_79
          +1 checkstyle 0m 18s branch-2.7 passed
          +1 mvnsite 0m 35s branch-2.7 passed
          +1 mvneclipse 0m 16s branch-2.7 passed
          -1 findbugs 1m 20s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in branch-2.7 has 1 extant Findbugs warnings.
          +1 javadoc 0m 27s branch-2.7 passed with JDK v1.8.0_60
          +1 javadoc 0m 25s branch-2.7 passed with JDK v1.7.0_79
          +1 mvninstall 0m 27s the patch passed
          +1 compile 0m 22s the patch passed with JDK v1.8.0_60
          +1 javac 0m 22s the patch passed
          +1 compile 0m 22s the patch passed with JDK v1.7.0_79
          +1 javac 0m 22s the patch passed
          -1 checkstyle 0m 14s Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager (total was 229, now 230).
          +1 mvnsite 0m 30s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          -1 whitespace 0m 2s The patch has 700 line(s) that end in whitespace. Use git apply --whitespace=fix.
          -1 whitespace 0m 19s The patch has 95 line(s) with tabs.
          +1 findbugs 1m 21s the patch passed
          +1 javadoc 0m 21s the patch passed with JDK v1.8.0_60
          +1 javadoc 0m 27s the patch passed with JDK v1.7.0_79
          -1 unit 52m 48s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_60.
          -1 unit 53m 7s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_79.
          -1 asflicense 44m 16s Patch generated 71 ASF License warnings.
          167m 48s



          Reason Tests
          JDK v1.8.0_60 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
            hadoop.yarn.server.resourcemanager.TestRMHAForNodeLabels
            hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA
            hadoop.yarn.server.resourcemanager.TestAMAuthorization
            hadoop.yarn.server.resourcemanager.TestResourceTrackerService
            hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStorePerf
            hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA
            hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
          JDK v1.7.0_79 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
            hadoop.yarn.server.resourcemanager.TestRMHAForNodeLabels
            hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA
            hadoop.yarn.server.resourcemanager.TestAMAuthorization
            hadoop.yarn.server.resourcemanager.TestResourceTrackerService
            hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStorePerf
            hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA
            hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore



          Subsystem Report/Notes
          Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-16
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12772452/YARN-4348-branch-2.7.002.patch
          JIRA Issue YARN-4348
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 8265c5a7ae9a 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/apache-yetus-fa12328/precommit/personality/hadoop.sh
          git revision branch-2.7 / 8ea88ab
          mvninstall https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/branch-mvninstall-root.txt
          findbugs v3.0.0
          findbugs https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-warnings.html
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/whitespace-eol.txt
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/whitespace-tabs.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_60.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_79.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_60.txt https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_79.txt
          JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9694/testReport/
          asflicense https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Max memory used 149MB
          Powered by Apache Yetus http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9694/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 8s docker + precommit patch detected. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 mvninstall 7m 24s root in branch-2.7 failed. +1 compile 0m 21s branch-2.7 passed with JDK v1.8.0_60 +1 compile 0m 22s branch-2.7 passed with JDK v1.7.0_79 +1 checkstyle 0m 18s branch-2.7 passed +1 mvnsite 0m 35s branch-2.7 passed +1 mvneclipse 0m 16s branch-2.7 passed -1 findbugs 1m 20s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in branch-2.7 has 1 extant Findbugs warnings. +1 javadoc 0m 27s branch-2.7 passed with JDK v1.8.0_60 +1 javadoc 0m 25s branch-2.7 passed with JDK v1.7.0_79 +1 mvninstall 0m 27s the patch passed +1 compile 0m 22s the patch passed with JDK v1.8.0_60 +1 javac 0m 22s the patch passed +1 compile 0m 22s the patch passed with JDK v1.7.0_79 +1 javac 0m 22s the patch passed -1 checkstyle 0m 14s Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager (total was 229, now 230). +1 mvnsite 0m 30s the patch passed +1 mvneclipse 0m 14s the patch passed -1 whitespace 0m 2s The patch has 700 line(s) that end in whitespace. Use git apply --whitespace=fix. -1 whitespace 0m 19s The patch has 95 line(s) with tabs. +1 findbugs 1m 21s the patch passed +1 javadoc 0m 21s the patch passed with JDK v1.8.0_60 +1 javadoc 0m 27s the patch passed with JDK v1.7.0_79 -1 unit 52m 48s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_60. -1 unit 53m 7s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_79. -1 asflicense 44m 16s Patch generated 71 ASF License warnings. 167m 48s Reason Tests JDK v1.8.0_60 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestRMHAForNodeLabels   hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA   hadoop.yarn.server.resourcemanager.TestAMAuthorization   hadoop.yarn.server.resourcemanager.TestResourceTrackerService   hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStorePerf   hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA   hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore JDK v1.7.0_79 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestRMHAForNodeLabels   hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA   hadoop.yarn.server.resourcemanager.TestAMAuthorization   hadoop.yarn.server.resourcemanager.TestResourceTrackerService   hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStorePerf   hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA   hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore Subsystem Report/Notes Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-16 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12772452/YARN-4348-branch-2.7.002.patch JIRA Issue YARN-4348 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 8265c5a7ae9a 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/apache-yetus-fa12328/precommit/personality/hadoop.sh git revision branch-2.7 / 8ea88ab mvninstall https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/branch-mvninstall-root.txt findbugs v3.0.0 findbugs https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-warnings.html checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt whitespace https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/whitespace-eol.txt whitespace https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/whitespace-tabs.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_60.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_79.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_60.txt https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_79.txt JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9694/testReport/ asflicense https://builds.apache.org/job/PreCommit-YARN-Build/9694/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Max memory used 149MB Powered by Apache Yetus http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/9694/console This message was automatically generated.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          YARN-4306 for the assertion error of getProxy.

          I'll check related test failure, TestZKRMStateStore.

          Show
          ozawa Tsuyoshi Ozawa added a comment - YARN-4306 for the assertion error of getProxy. I'll check related test failure, TestZKRMStateStore.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          After taking a look deeper, the reply packet against sync operation looks to be pending in blocking list of ClientCnxn#EventTrread. After the thread interrupted, it's processed correctly.

          2015-11-19 07:21:47,955 DEBUG [main-SendThread(127.0.0.1:11221)] zookeeper.ClientCnxn (ClientCnxn.java:readResponse(733)) - Got auth sessionid:0x1511cb079430000
          ...
          2015-11-19 07:21:48,019 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(88)) - Processing request:: sessionid:0x1511cb079430000 type:sync: cxid:0xb zxid:0xfffffffffffffffe txntype:unknown reqpath:/rmstore/ZKRMStateRoot
          ...
          2015-11-19 07:21:48,013 DEBUG [main-SendThread(127.0.0.1:11221)] zookeeper.ClientCnxn (ClientCnxn.java:readResponse(818)) - Reading reply sessionid:0x1511cb079430000, packet:: clientPath:null serverPath:null finished:false header:: 10,1  replyHeader:: 10,11,0  request:: '/rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot,,v{s{31,s{'world,'anyone}}},0  response:: '/rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot 
          2015-11-19 07:21:48,019 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(88)) - Processing request:: sessionid:0x1511cb079430000 type:sync: cxid:0xb zxid:0xfffffffffffffffe txntype:unknown reqpath:/rmstore/ZKRMStateRoot
          2015-11-19 07:21:48,019 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(160)) - sessionid:0x1511cb079430000 type:sync: cxid:0xb zxid:0xfffffffffffffffe txntype:unknown reqpath:/rmstore/ZKRMStateRoot
          ...
          2015-11-19 07:22:03,027 INFO  [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore failed in state STARTED; cause: java.io.IOException: failing to sync operation at starting up RM
          java.io.IOException: failing to sync operation at starting up RM
          ...
          2015-11-19 07:22:03,029 INFO  [main] event.AsyncDispatcher (AsyncDispatcher.java:serviceStop(141)) - AsyncDispatcher is draining to stop, igonring any new events.
          2015-11-19 07:22:03,030 INFO  [main-EventThread] recovery.ZKRMStateStore (ZKRMStateStore.java:processResult(122)) - ZooKeeper sync operation succeeded. path: /rmstore/ZKRMStateRoot
          2015-11-19 07:22:03,030 INFO  [org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread] recovery.ZKRMStateStore (ZKRMStateStore.java:run(1131)) - org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread thread interrupted! Exiting!
          2015-11-19 07:22:03,030 INFO  [main-EventThread] recovery.ZKRMStateStore (ZKRMStateStore.java:processResult(124)) - ZooKeeper sync operation succeeded. path: /rmstore/ZKRMStateRoot
          
          Show
          ozawa Tsuyoshi Ozawa added a comment - After taking a look deeper, the reply packet against sync operation looks to be pending in blocking list of ClientCnxn#EventTrread. After the thread interrupted, it's processed correctly. 2015-11-19 07:21:47,955 DEBUG [main-SendThread(127.0.0.1:11221)] zookeeper.ClientCnxn (ClientCnxn.java:readResponse(733)) - Got auth sessionid:0x1511cb079430000 ... 2015-11-19 07:21:48,019 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(88)) - Processing request:: sessionid:0x1511cb079430000 type:sync: cxid:0xb zxid:0xfffffffffffffffe txntype:unknown reqpath:/rmstore/ZKRMStateRoot ... 2015-11-19 07:21:48,013 DEBUG [main-SendThread(127.0.0.1:11221)] zookeeper.ClientCnxn (ClientCnxn.java:readResponse(818)) - Reading reply sessionid:0x1511cb079430000, packet:: clientPath: null serverPath: null finished: false header:: 10,1 replyHeader:: 10,11,0 request:: '/rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot,,v{s{31,s{'world,'anyone}}},0 response:: '/rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot 2015-11-19 07:21:48,019 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(88)) - Processing request:: sessionid:0x1511cb079430000 type:sync: cxid:0xb zxid:0xfffffffffffffffe txntype:unknown reqpath:/rmstore/ZKRMStateRoot 2015-11-19 07:21:48,019 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(160)) - sessionid:0x1511cb079430000 type:sync: cxid:0xb zxid:0xfffffffffffffffe txntype:unknown reqpath:/rmstore/ZKRMStateRoot ... 2015-11-19 07:22:03,027 INFO [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore failed in state STARTED; cause: java.io.IOException: failing to sync operation at starting up RM java.io.IOException: failing to sync operation at starting up RM ... 2015-11-19 07:22:03,029 INFO [main] event.AsyncDispatcher (AsyncDispatcher.java:serviceStop(141)) - AsyncDispatcher is draining to stop, igonring any new events. 2015-11-19 07:22:03,030 INFO [main-EventThread] recovery.ZKRMStateStore (ZKRMStateStore.java:processResult(122)) - ZooKeeper sync operation succeeded. path: /rmstore/ZKRMStateRoot 2015-11-19 07:22:03,030 INFO [org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread] recovery.ZKRMStateStore (ZKRMStateStore.java:run(1131)) - org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread thread interrupted! Exiting! 2015-11-19 07:22:03,030 INFO [main-EventThread] recovery.ZKRMStateStore (ZKRMStateStore.java:processResult(124)) - ZooKeeper sync operation succeeded. path: /rmstore/ZKRMStateRoot
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Found that this is caused by the lock ordering.

          1. (In main thread of RM) locking ZKRMStateStore(startInternal) -> waiting for lock.await()
          2. ZK's eventThread: Got SyncConnected event from ZK -> Calling ForwardingWatcher#process -> processWatchEvent called, but ZKRMStateStore has been locked since 1
          3. (In main thread of RM) timeout and IOException -> unlocking ZKRMStateStore() -> the callback, processEvent, of sync is fired.

          I will attach a patch to address this problem.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Found that this is caused by the lock ordering. 1. (In main thread of RM) locking ZKRMStateStore(startInternal) -> waiting for lock.await() 2. ZK's eventThread: Got SyncConnected event from ZK -> Calling ForwardingWatcher#process -> processWatchEvent called, but ZKRMStateStore has been locked since 1 3. (In main thread of RM) timeout and IOException -> unlocking ZKRMStateStore() -> the callback, processEvent, of sync is fired. I will attach a patch to address this problem.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Archiving artifacts
          [description-setter] Description set: YARN-4348
          Recording test results
          ERROR: Publisher 'Publish JUnit test result report' failed: No test report files were found. Configuration error?
          Email was triggered for: Failure - Any
          Sending email for trigger: Failure - Any
          An attempt to send an e-mail to empty list of recipients, ignored.
          Finished: FAILURE

          Hmm, Jenkins looks to be unhealthy.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Archiving artifacts [description-setter] Description set: YARN-4348 Recording test results ERROR: Publisher 'Publish JUnit test result report' failed: No test report files were found. Configuration error? Email was triggered for: Failure - Any Sending email for trigger: Failure - Any An attempt to send an e-mail to empty list of recipients, ignored. Finished: FAILURE Hmm, Jenkins looks to be unhealthy.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Kicking Jenkins again.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Kicking Jenkins again.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Ran test since last Jenkins failed to launch.

          -1 overall.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 javadoc. The javadoc tool appears to have generated 2079 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version ) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Ran test since last Jenkins failed to launch. -1 overall. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated 2079 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version ) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings.
          Hide
          jianhe Jian He added a comment -

          Tsuyoshi Ozawa, should below call continue in the end ?

                    if (shouldRetryWithNewConnection(ke.code()) && retry < numRetries) {
                      LOG.info("Retrying operation on ZK with new Connection. " +
                          "Retry no. " + retry);
                      Thread.sleep(zkRetryInterval);
                      createConnection();
                      syncInternal(ke.getPath());
                    }
          
          Show
          jianhe Jian He added a comment - Tsuyoshi Ozawa , should below call continue in the end ? if (shouldRetryWithNewConnection(ke.code()) && retry < numRetries) { LOG.info( "Retrying operation on ZK with new Connection. " + "Retry no. " + retry); Thread .sleep(zkRetryInterval); createConnection(); syncInternal(ke.getPath()); }
          Hide
          ozawa Tsuyoshi Ozawa added a comment - - edited

          Jian He good catch. Adding missing continue statement after calling syncInternal in v4 patch.

          Show
          ozawa Tsuyoshi Ozawa added a comment - - edited Jian He good catch. Adding missing continue statement after calling syncInternal in v4 patch.
          Hide
          ozawa Tsuyoshi Ozawa added a comment - - edited

          Jenkins still fail. Opened YETUS-217 to track the problem. Kicking test-patch.sh on local.

          Show
          ozawa Tsuyoshi Ozawa added a comment - - edited Jenkins still fail. Opened YETUS-217 to track the problem. Kicking test-patch.sh on local.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Ran test since last Jenkins failed to launch. javadoc seems to be false positive since this patch doesn't include any changes of javadocs.

          -1 overall.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 javadoc. The javadoc tool appears to have generated 2079 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version ) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          Jian He could you take a look at latest patch?

          Show
          ozawa Tsuyoshi Ozawa added a comment - Ran test since last Jenkins failed to launch. javadoc seems to be false positive since this patch doesn't include any changes of javadocs. -1 overall. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated 2079 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version ) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. Jian He could you take a look at latest patch?
          Hide
          djp Junping Du added a comment -

          Hi Tsuyoshi Ozawa and Jian He, is this a blocker bug? Can we move it to 2.6.4?

          Show
          djp Junping Du added a comment - Hi Tsuyoshi Ozawa and Jian He , is this a blocker bug? Can we move it to 2.6.4?
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Junping Du IMHO, this is a blocker ticket of 2.6.3 and 2.7.3 since the problem is more serious than I've thought. please check this comment. This is a unexpected behaviour when RM fails over and it prevents RM fail over correctly.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Junping Du IMHO, this is a blocker ticket of 2.6.3 and 2.7.3 since the problem is more serious than I've thought. please check this comment . This is a unexpected behaviour when RM fails over and it prevents RM fail over correctly.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          zhihai xu Jian He
          I'm rethinking of this comment about sync callback to wait for sync completion: this can cause the lock problem described here.

          To deal with problem easily, we can just remove a barrier by the sync callback. This works well because ZK client's requests are sent to ZK server in order, unless ZK master server fails while recreating ZK connection. Quorum sync, ZOOKEEPER-2136, is good helper to deal with the corner case.

          What do you think?

          Show
          ozawa Tsuyoshi Ozawa added a comment - zhihai xu Jian He I'm rethinking of this comment about sync callback to wait for sync completion: this can cause the lock problem described here . To deal with problem easily, we can just remove a barrier by the sync callback. This works well because ZK client's requests are sent to ZK server in order, unless ZK master server fails while recreating ZK connection. Quorum sync, ZOOKEEPER-2136 , is good helper to deal with the corner case. What do you think?
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Jian He could you take a look?

          Show
          ozawa Tsuyoshi Ozawa added a comment - Jian He could you take a look?
          Hide
          jianhe Jian He added a comment -

          lgtm, thanks !

          Show
          jianhe Jian He added a comment - lgtm, thanks !
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Checking this in.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Checking this in.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Committed this to branch-2.7. Thanks Jian He for reviewing and reporting!

          I will cherrypick this to branch-2.6 after running tests.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Committed this to branch-2.7. Thanks Jian He for reviewing and reporting! I will cherrypick this to branch-2.6 after running tests.
          Hide
          djp Junping Du added a comment -

          I will cherrypick this to branch-2.6 after running tests.

          Hi Tsuyoshi Ozawa, would you check this in 2.6.3 branch as well given we mark this as a blocker for 2.6.3? Thanks!

          Show
          djp Junping Du added a comment - I will cherrypick this to branch-2.6 after running tests. Hi Tsuyoshi Ozawa , would you check this in 2.6.3 branch as well given we mark this as a blocker for 2.6.3? Thanks!
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Ran tests locally and pass tests on branch-2.6. Committing this to branch-2.6.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Ran tests locally and pass tests on branch-2.6. Committing this to branch-2.6.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Junping Du I committed this to branch-2.6, which is targeting 2.6.3. Can I push this to branch-2.6.3?

          Show
          ozawa Tsuyoshi Ozawa added a comment - Junping Du I committed this to branch-2.6, which is targeting 2.6.3. Can I push this to branch-2.6.3?
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Now I committed this to branch-2.6.3 too. Thanks!

          Show
          ozawa Tsuyoshi Ozawa added a comment - Now I committed this to branch-2.6.3 too. Thanks!
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #8938 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8938/)
          Update CHANGES.txt for commit of YARN-4348 to branch-2.7 and branch-2.6. (ozawa: rev d7b3f8dbe818cff5fee4f4c0c70d306776aa318e)

          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8938 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8938/ ) Update CHANGES.txt for commit of YARN-4348 to branch-2.7 and branch-2.6. (ozawa: rev d7b3f8dbe818cff5fee4f4c0c70d306776aa318e) hadoop-yarn-project/CHANGES.txt
          Hide
          djp Junping Du added a comment -

          Sounds good. Thanks!

          Show
          djp Junping Du added a comment - Sounds good. Thanks!
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #675 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/675/)
          Update CHANGES.txt for commit of YARN-4348 to branch-2.7 and branch-2.6. (ozawa: rev d7b3f8dbe818cff5fee4f4c0c70d306776aa318e)

          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #675 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/675/ ) Update CHANGES.txt for commit of YARN-4348 to branch-2.7 and branch-2.6. (ozawa: rev d7b3f8dbe818cff5fee4f4c0c70d306776aa318e) hadoop-yarn-project/CHANGES.txt
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Pulled this into 2.7.2 to keep the release up-to-date with 2.6.3. Changing fix-versions to reflect the same.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Pulled this into 2.7.2 to keep the release up-to-date with 2.6.3. Changing fix-versions to reflect the same.
          Hide
          djp Junping Du added a comment -

          Hi Jian He and Tsuyoshi Ozawa, Does this fix need to go to trunk/branch-2/branch-2.8?

          Show
          djp Junping Du added a comment - Hi Jian He and Tsuyoshi Ozawa , Does this fix need to go to trunk/branch-2/branch-2.8?
          Hide
          jianhe Jian He added a comment -

          No, it doesn't need to. The zkstore implementation has been changed by using curator 2.8 upwards

          Show
          jianhe Jian He added a comment - No, it doesn't need to. The zkstore implementation has been changed by using curator 2.8 upwards
          Hide
          djp Junping Du added a comment -

          Got it. Thanks for confirmation here, Jian!

          Show
          djp Junping Du added a comment - Got it. Thanks for confirmation here, Jian!
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Yes, Jian is correct.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Yes, Jian is correct.

            People

            • Assignee:
              ozawa Tsuyoshi Ozawa
              Reporter:
              ozawa Tsuyoshi Ozawa
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development