Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2992

ZKRMStateStore crashes due to session expiry

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      We recently saw the RM crash with the following stacktrace. On session expiry, we should gracefully transition to standby.

      2014-12-18 06:28:42,689 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: 
      org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired 
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
      at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
      at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930) 
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927) 
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069) 
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088) 
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927) 
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941) 
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958) 
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687) 
      
      1. yarn-2992-1.patch
        1 kB
        Karthik Kambatla

        Activity

        Hide
        kasha Karthik Kambatla added a comment -

        Patch that retries on session changes as well, also added connection re-establishment.

        Again, in the longer term, a Curator-based implementation would be both more simple and robust.

        Show
        kasha Karthik Kambatla added a comment - Patch that retries on session changes as well, also added connection re-establishment. Again, in the longer term, a Curator-based implementation would be both more simple and robust.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12689132/yarn-2992-1.patch
        against trunk revision a164ce2.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        -1 findbugs. The patch appears to introduce 15 new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6192//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6192//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6192//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12689132/yarn-2992-1.patch against trunk revision a164ce2. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 15 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6192//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6192//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6192//console This message is automatically generated.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        It makes sense to me. +1 for the issue.
        And also I would like to bring up the scenario where ZK is not available during RM start up. I have observed that RM exits while starting if ZK is not available. Why RM can not be transit to standby?

        Show
        rohithsharma Rohith Sharma K S added a comment - It makes sense to me. +1 for the issue. And also I would like to bring up the scenario where ZK is not available during RM start up. I have observed that RM exits while starting if ZK is not available. Why RM can not be transit to standby?
        Hide
        jianhe Jian He added a comment -

        lgtm,

        I have observed that RM exits while starting if ZK is not available

        I think we have retry built in for this scenario ?

        Show
        jianhe Jian He added a comment - lgtm, I have observed that RM exits while starting if ZK is not available I think we have retry built in for this scenario ?
        Hide
        jianhe Jian He added a comment -

        one question: do we need to create a new zkClient object by calling createConnection, or is it OK to re-use the old one ?

        Show
        jianhe Jian He added a comment - one question: do we need to create a new zkClient object by calling createConnection, or is it OK to re-use the old one ?
        Hide
        kasha Karthik Kambatla added a comment -

        one question: do we need to create a new zkClient object by calling createConnection, or is it OK to re-use the old one ?

        Thought about it some at the time of working on the patch. We probably don't need the call to createConnection, as the watcher would probably go off before the next retry or the next. However, given the frequency of session expiries and lost connections, I felt it should be okay to explicitly createConnection. I don't think that will add a significant overhead or lead to inaccuracies.

        Show
        kasha Karthik Kambatla added a comment - one question: do we need to create a new zkClient object by calling createConnection, or is it OK to re-use the old one ? Thought about it some at the time of working on the patch. We probably don't need the call to createConnection, as the watcher would probably go off before the next retry or the next. However, given the frequency of session expiries and lost connections, I felt it should be okay to explicitly createConnection. I don't think that will add a significant overhead or lead to inaccuracies.
        Hide
        jianhe Jian He added a comment -

        sounds good. committing.

        Show
        jianhe Jian He added a comment - sounds good. committing.
        Hide
        jianhe Jian He added a comment -

        Committed to trunk and branch-2, thanks Karthik !

        Thanks Rohith for reviewing the patch !

        Show
        jianhe Jian He added a comment - Committed to trunk and branch-2, thanks Karthik ! Thanks Rohith for reviewing the patch !
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #6792 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6792/)
        YARN-2992. ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #6792 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6792/ ) YARN-2992 . ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java hadoop-yarn-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #54 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/54/)
        YARN-2992. ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953)

        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #54 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/54/ ) YARN-2992 . ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #788 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/788/)
        YARN-2992. ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #788 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/788/ ) YARN-2992 . ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java hadoop-yarn-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Hdfs-trunk #1986 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1986/)
        YARN-2992. ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #1986 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1986/ ) YARN-2992 . ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java hadoop-yarn-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #51 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/51/)
        YARN-2992. ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #51 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/51/ ) YARN-2992 . ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java hadoop-yarn-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #55 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/55/)
        YARN-2992. ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #55 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/55/ ) YARN-2992 . ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java hadoop-yarn-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2005 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2005/)
        YARN-2992. ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953)

        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2005 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2005/ ) YARN-2992 . ZKRMStateStore crashes due to session expiry. Contributed by Karthik Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        I see, yes.. In my cluster, configured retry was very less, so Rm was exiting very soon.

        Show
        rohithsharma Rohith Sharma K S added a comment - I see, yes.. In my cluster, configured retry was very less, so Rm was exiting very soon.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        I see, yes.. In my cluster, configured retry was very less, so Rm was exiting very soon.

        Show
        rohithsharma Rohith Sharma K S added a comment - I see, yes.. In my cluster, configured retry was very less, so Rm was exiting very soon.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        I see, yes.. In my cluster, configured retry was very less, so Rm was exiting very soon.

        Show
        rohithsharma Rohith Sharma K S added a comment - I see, yes.. In my cluster, configured retry was very less, so Rm was exiting very soon.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        I see, yes.. In my cluster, configured retry was very less, so Rm was exiting very soon.

        Show
        rohithsharma Rohith Sharma K S added a comment - I see, yes.. In my cluster, configured retry was very less, so Rm was exiting very soon.
        Hide
        chenchun Chun Chen added a comment -

        Karthik Kambatla Rohith Sharma K S Jian He, we are constantly facing the following error
        RM log

        2015-01-27 00:13:19,379 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.196.128.13/10.196.128.13:2181. Will not attempt to authenticate using SASL (unknown erro
        r)
        2015-01-27 00:13:19,383 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.196.128.13/10.196.128.13:2181, initiating session
        2015-01-27 00:13:19,404 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.196.128.13/10.196.128.13:2181, sessionid = 0x24ab193421e4812, negotiated timeout = 
        10000
        2015-01-27 00:13:19,417 WARN org.apache.zookeeper.ClientCnxn: Session 0x24ab193421e4812 for server 10.196.128.13/10.196.128.13:2181, unexpected error, closing socket connection and attempti
        ng reconnect
        java.io.IOException: Broken pipe
                at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
                at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
                at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
                at sun.nio.ch.IOUtil.write(IOUtil.java:65)
                at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470)
                at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
                at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
                at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
        2015-01-27 00:13:19,517 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
        org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
                at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
                at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
                at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
                at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:895)
                at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:892)
                at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1031)
                at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1050)
                at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:898)
                at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.access$600(ZKRMStateStore.java:82)
                at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread.run(ZKRMStateStore.java:1003)
        2015-01-27 00:13:19,518 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 934
        

        ZK log

        2015-01-27 00:13:19,300 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.240.92.100:46464
        2015-01-27 00:13:19,302 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46464
        2015-01-27 00:13:19,302 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x24ab193421e4812
        2015-01-27 00:13:19,303 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617] - Established session 0x24ab193421e4812 with negotiated timeout 10000 for client /10.240.92.100:46464
        2015-01-27 00:13:19,303 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892] - got auth packet /10.240.92.100:46464
        2015-01-27 00:13:19,303 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926] - auth success /10.240.92.100:46464
        2015-01-27 00:13:19,320 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x24ab193421e4812 due to java.io.IOException: Len error 1425415
        2015-01-27 00:13:19,321 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.240.92.100:46464 which had sessionid 0x24ab193421e4812
        2015-01-27 00:13:23,093 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.240.92.100:46477
        2015-01-27 00:13:23,159 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46477
        2015-01-27 00:13:23,159 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x24ab193421e4812
        2015-01-27 00:13:23,160 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617] - Established session 0x24ab193421e4812 with negotiated timeout 10000 for client /10.240.92.100:46477
        2015-01-27 00:13:23,160 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892] - got auth packet /10.240.92.100:46477
        2015-01-27 00:13:23,160 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926] - auth success /10.240.92.100:46477
        2015-01-27 00:13:23,170 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x24ab193421e4812 due to java.io.IOException: Len error 1425415
        2015-01-27 00:13:23,171 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.240.92.100:46477 which had sessionid 0x24ab193421e4812
        

        It seems when zk receives a large packet ( Len error 1425415 > 1M) and then initiatively closed the socket connection due to the large packet error. Haven't tried the patch here which renews a zk connection each time it fails to execute a zk operation, but I can't figure out why this happens, since VerifyActiveStatusThread only creates and deletes the fencing node. The packet shouldn't be larger than 1M. Do you have any clues why this happens? Thanks.

        Show
        chenchun Chun Chen added a comment - Karthik Kambatla Rohith Sharma K S Jian He , we are constantly facing the following error RM log 2015-01-27 00:13:19,379 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.196.128.13/10.196.128.13:2181. Will not attempt to authenticate using SASL (unknown erro r) 2015-01-27 00:13:19,383 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.196.128.13/10.196.128.13:2181, initiating session 2015-01-27 00:13:19,404 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.196.128.13/10.196.128.13:2181, sessionid = 0x24ab193421e4812, negotiated timeout = 10000 2015-01-27 00:13:19,417 WARN org.apache.zookeeper.ClientCnxn: Session 0x24ab193421e4812 for server 10.196.128.13/10.196.128.13:2181, unexpected error, closing socket connection and attempti ng reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) 2015-01-27 00:13:19,517 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:895) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:892) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1050) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:898) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.access$600(ZKRMStateStore.java:82) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread.run(ZKRMStateStore.java:1003) 2015-01-27 00:13:19,518 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 934 ZK log 2015-01-27 00:13:19,300 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.240.92.100:46464 2015-01-27 00:13:19,302 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46464 2015-01-27 00:13:19,302 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x24ab193421e4812 2015-01-27 00:13:19,303 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617] - Established session 0x24ab193421e4812 with negotiated timeout 10000 for client /10.240.92.100:46464 2015-01-27 00:13:19,303 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892] - got auth packet /10.240.92.100:46464 2015-01-27 00:13:19,303 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926] - auth success /10.240.92.100:46464 2015-01-27 00:13:19,320 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x24ab193421e4812 due to java.io.IOException: Len error 1425415 2015-01-27 00:13:19,321 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.240.92.100:46464 which had sessionid 0x24ab193421e4812 2015-01-27 00:13:23,093 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.240.92.100:46477 2015-01-27 00:13:23,159 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46477 2015-01-27 00:13:23,159 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x24ab193421e4812 2015-01-27 00:13:23,160 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617] - Established session 0x24ab193421e4812 with negotiated timeout 10000 for client /10.240.92.100:46477 2015-01-27 00:13:23,160 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892] - got auth packet /10.240.92.100:46477 2015-01-27 00:13:23,160 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926] - auth success /10.240.92.100:46477 2015-01-27 00:13:23,170 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x24ab193421e4812 due to java.io.IOException: Len error 1425415 2015-01-27 00:13:23,171 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.240.92.100:46477 which had sessionid 0x24ab193421e4812 It seems when zk receives a large packet ( Len error 1425415 > 1M) and then initiatively closed the socket connection due to the large packet error. Haven't tried the patch here which renews a zk connection each time it fails to execute a zk operation, but I can't figure out why this happens, since VerifyActiveStatusThread only creates and deletes the fencing node. The packet shouldn't be larger than 1M. Do you have any clues why this happens? Thanks.
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Pulled this into 2.6.1. Ran compilation before the push. Patch applied cleanly.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Pulled this into 2.6.1. Ran compilation before the push. Patch applied cleanly.

          People

          • Assignee:
            kasha Karthik Kambatla
            Reporter:
            kasha Karthik Kambatla
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development