Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3753

RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.7.1
    • Component/s: yarn
    • Labels:
      None

      Description

      RM failed to come up with the following error while submitting an mapreduce job.

      RM log
      015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006
      java.io.IOException: Wait for ZKClient creation timed out
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
      	at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
      	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
      	at java.lang.Thread.run(Thread.java:745)
      2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(750)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
      java.io.IOException: Wait for ZKClient creation timed out
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
      	at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
      	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
      	at java.lang.Thread.run(Thread.java:745)
      
      1. YARN-3753.1.patch
        4 kB
        Jian He
      2. YARN-3753.2.patch
        4 kB
        Jian He
      3. YARN-3753.patch
        2 kB
        Jian He

        Activity

        Hide
        jianhe Jian He added a comment -

        This happens because this exception new IOException("Wait for ZKClient creation timed out"); is not retried by upper level runWithRetries method which causes RM to fail. we've seen quite a few issues regarding the retry logic of zk-store, YARN-2716 should be the long-term solution to fix all these. In the interim, I'm writing a quick work-around patch for this, as this problem makes RM unavailable.

        Show
        jianhe Jian He added a comment - This happens because this exception new IOException("Wait for ZKClient creation timed out"); is not retried by upper level runWithRetries method which causes RM to fail. we've seen quite a few issues regarding the retry logic of zk-store, YARN-2716 should be the long-term solution to fix all these. In the interim, I'm writing a quick work-around patch for this, as this problem makes RM unavailable.
        Hide
        kasha Karthik Kambatla added a comment -

        Jian He - YARN-2716 is ready for review. I can make time for addressing any comments to get this in for trunk and branch-2. Given that, would it make sense to limit this fix to branch-2.7?

        Show
        kasha Karthik Kambatla added a comment - Jian He - YARN-2716 is ready for review. I can make time for addressing any comments to get this in for trunk and branch-2. Given that, would it make sense to limit this fix to branch-2.7?
        Hide
        jianhe Jian He added a comment -

        Karthik Kambatla, sure, make sense, this can go into branch-2.7 only. And YARN-2716 can get in for trunk and branch-2.

        Show
        jianhe Jian He added a comment - Karthik Kambatla , sure, make sense, this can go into branch-2.7 only. And YARN-2716 can get in for trunk and branch-2.
        Hide
        jianhe Jian He added a comment -

        After investigating more with Xuan, The problem is actually that the wait time in below method is set to zkSessionTimeout (10 seconds only), which doesn't actually make much sense. Here, the wait time is to wait for the zk-connection to be re-established

        while (zkClient == null) {
                  ZKRMStateStore.this.wait(zkSessionTimeout);
                  if (zkClient != null) {
                    break;
                  }
        
        Show
        jianhe Jian He added a comment - After investigating more with Xuan, The problem is actually that the wait time in below method is set to zkSessionTimeout (10 seconds only), which doesn't actually make much sense. Here, the wait time is to wait for the zk-connection to be re-established while (zkClient == null ) { ZKRMStateStore. this .wait(zkSessionTimeout); if (zkClient != null ) { break ; }
        Hide
        jianhe Jian He added a comment - - edited

        Upload a patch to set the wait time based on numRetries*retry-interval.

        I reproduced this issue locally in following way.
        1. start ZK
        2. start RM
        3. kill ZK.
        4. submit a job

        • without the patch, RM will fail with the same IOException("Wait for ZKClient creation timed out")
        • with the patch, after re-start ZK server, RM and job can continue run successfully.
        Show
        jianhe Jian He added a comment - - edited Upload a patch to set the wait time based on numRetries*retry-interval. I reproduced this issue locally in following way. 1. start ZK 2. start RM 3. kill ZK. 4. submit a job without the patch, RM will fail with the same IOException("Wait for ZKClient creation timed out") with the patch, after re-start ZK server, RM and job can continue run successfully.
        Hide
        xgong Xuan Gong added a comment -

        This is short time solution and this is for branch-2.7 only. The main idea here is to increasing the waiting for RM to re-connect to ZK.

        I am ok with this patch. Will commit it later unless Karthik Kambatla has additional comments.

        Show
        xgong Xuan Gong added a comment - This is short time solution and this is for branch-2.7 only. The main idea here is to increasing the waiting for RM to re-connect to ZK. I am ok with this patch. Will commit it later unless Karthik Kambatla has additional comments.
        Hide
        hadoopqa Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        0 pre-patch 16m 8s Pre-patch trunk compilation is healthy.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 javac 7m 37s There were no new javac warning messages.
        +1 javadoc 9m 36s There were no new javadoc warning messages.
        +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
        +1 checkstyle 0m 47s There were no new checkstyle issues.
        +1 whitespace 0m 0s The patch has no lines that end in whitespace.
        +1 install 1m 36s mvn install still works.
        +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
        +1 findbugs 1m 26s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
        -1 yarn tests 50m 33s Tests failed in hadoop-yarn-server-resourcemanager.
            88m 42s  



        Reason Tests
        Failed unit tests hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12736694/YARN-3753.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / cdc13ef
        hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8154/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8154/testReport/
        Java 1.7.0_55
        uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/8154/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 8s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac 7m 37s There were no new javac warning messages. +1 javadoc 9m 36s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 47s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 36s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 26s The patch does not introduce any new Findbugs (version 3.0.0) warnings. -1 yarn tests 50m 33s Tests failed in hadoop-yarn-server-resourcemanager.     88m 42s   Reason Tests Failed unit tests hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12736694/YARN-3753.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / cdc13ef hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8154/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8154/testReport/ Java 1.7.0_55 uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8154/console This message was automatically generated.
        Hide
        jianhe Jian He added a comment -

        updated the test case to cover the change too

        Show
        jianhe Jian He added a comment - updated the test case to cover the change too
        Hide
        hadoopqa Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        -1 pre-patch 14m 53s Findbugs (version ) appears to be broken on trunk.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
        +1 javac 7m 31s There were no new javac warning messages.
        +1 javadoc 9m 29s There were no new javadoc warning messages.
        +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
        +1 checkstyle 0m 25s There were no new checkstyle issues.
        +1 whitespace 0m 0s The patch has no lines that end in whitespace.
        +1 install 1m 33s mvn install still works.
        +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
        +1 findbugs 1m 27s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
        +1 yarn tests 50m 16s Tests passed in hadoop-yarn-server-resourcemanager.
            86m 33s  



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12736741/YARN-3753.1.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / 990078b
        hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8161/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8161/testReport/
        Java 1.7.0_55
        uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/8161/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 pre-patch 14m 53s Findbugs (version ) appears to be broken on trunk. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 31s There were no new javac warning messages. +1 javadoc 9m 29s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 25s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 33s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 27s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 50m 16s Tests passed in hadoop-yarn-server-resourcemanager.     86m 33s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12736741/YARN-3753.1.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 990078b hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8161/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8161/testReport/ Java 1.7.0_55 uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8161/console This message was automatically generated.
        Hide
        hadoopqa Hadoop QA added a comment -



        +1 overall



        Vote Subsystem Runtime Comment
        0 pre-patch 15m 53s Pre-patch trunk compilation is healthy.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
        +1 javac 7m 33s There were no new javac warning messages.
        +1 javadoc 9m 36s There were no new javadoc warning messages.
        +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
        +1 checkstyle 0m 48s There were no new checkstyle issues.
        +1 whitespace 0m 0s The patch has no lines that end in whitespace.
        +1 install 1m 34s mvn install still works.
        +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
        +1 findbugs 1m 27s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
        +1 yarn tests 50m 6s Tests passed in hadoop-yarn-server-resourcemanager.
            88m 7s  



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12736776/YARN-3753.2.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / 990078b
        hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8164/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8164/testReport/
        Java 1.7.0_55
        uname Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/8164/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 pre-patch 15m 53s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 33s There were no new javac warning messages. +1 javadoc 9m 36s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 48s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 34s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 27s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 50m 6s Tests passed in hadoop-yarn-server-resourcemanager.     88m 7s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12736776/YARN-3753.2.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 990078b hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8164/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8164/testReport/ Java 1.7.0_55 uname Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8164/console This message was automatically generated.
        Hide
        kasha Karthik Kambatla added a comment -

        Fix looks reasonable to me.

        Show
        kasha Karthik Kambatla added a comment - Fix looks reasonable to me.
        Hide
        xgong Xuan Gong added a comment -

        +1, LGTM. Check this in

        Show
        xgong Xuan Gong added a comment - +1, LGTM. Check this in
        Hide
        xgong Xuan Gong added a comment -

        Committed into branch-2.7. Thanks, Jian

        Show
        xgong Xuan Gong added a comment - Committed into branch-2.7. Thanks, Jian
        Hide
        sumit.nigam Sumit Nigam added a comment -

        I had a question. Do I need to explicitly set some yarn-site parameter to control runWithRetries in such a case? If so, which parameter needs to be set?

        Show
        sumit.nigam Sumit Nigam added a comment - I had a question. Do I need to explicitly set some yarn-site parameter to control runWithRetries in such a case? If so, which parameter needs to be set?

          People

          • Assignee:
            jianhe Jian He
            Reporter:
            ssathish@hortonworks.com Sumana Sathish
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development