Details

    • Type: Sub-task Sub-task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.4.5
    • Fix Version/s: 3.4.7
    • Component/s: tests
    • Labels:
      None

      Description

      Following failure pattern has been observed many times in windows build. After creating the zookeeper client, the respective connection bean is not available in the jmx beans and is failing the tests.

          [junit] 2014-01-22 08:58:22,625 [myid:] - INFO  [main:ZKTestCase$1@65] - FAILED testInvalidVersion
          [junit] junit.framework.AssertionFailedError: expected [0x143b92b03330000] expected:<1> but was:<0>
          [junit] 	at junit.framework.Assert.fail(Assert.java:47)
          [junit] 	at junit.framework.Assert.failNotEquals(Assert.java:283)
          [junit] 	at junit.framework.Assert.assertEquals(Assert.java:64)
          [junit] 	at junit.framework.Assert.assertEquals(Assert.java:195)
          [junit] 	at org.apache.zookeeper.test.JMXEnv.ensureAll(JMXEnv.java:124)
          [junit] 	at org.apache.zookeeper.test.ClientBase.createClient(ClientBase.java:191)
          [junit] 	at org.apache.zookeeper.test.ClientBase.createClient(ClientBase.java:171)
          [junit] 	at org.apache.zookeeper.test.ClientBase.createClient(ClientBase.java:156)
          [junit] 	at org.apache.zookeeper.test.ClientBase.createClient(ClientBase.java:149)
          [junit] 	at org.apache.zookeeper.test.MultiTransactionTest.setUp(MultiTransactionTest.java:60)
      

        Issue Links

          Activity

          Rakesh R created issue -
          Rakesh R made changes -
          Field Original Value New Value
          Fix Version/s 3.4.6 [ 12323310 ]
          Hide
          Rakesh R added a comment -

          Reference : https://builds.apache.org/job/ZooKeeper-3.4-WinVS2008_java/416/

          I've done intial analysis and please see the life span of the session 0x143b92b03330000:

          Phase#1 : Here it has established the connection

              [junit] 2014-01-22 08:57:20,683 [myid:] - INFO  [main:ZooKeeper@438] - Initiating client connection, connectString=127.0.0.1:11232 sessionTimeout=30000 watcher=org.apache.zookeeper.test.ClientBase$CountdownWatcher@406be93e
              [junit] 2014-01-22 08:57:20,739 [myid:] - INFO  [main-SendThread(127.0.0.1:11232):ClientCnxn$SendThread@1228] - Session establishment complete on server 127.0.0.1/127.0.0.1:11232, sessionid = 0x143b92b03330000, negotiated timeout = 30000
          

          Phase#2 : Here it shows the client is not able to get the response from the zk server

              [junit] 2014-01-22 08:57:20,740 [myid:] - INFO  [main:JMXEnv@113] - expect:0x143b92b03330000
              [junit] 2014-01-22 08:57:20,841 [myid:] - INFO  [main:JMXEnv@113] - expect:0x143b92b03330000
              [junit] 2014-01-22 08:57:21,437 [myid:] - INFO  [main-SendThread(127.0.0.1:11232):ClientCnxn$SendThread@1091] - Unable to read additional data from server sessionid 0x143b92b03330000, likely server has closed socket, closing socket connection and attempting reconnect
          

          Phase#3 : Here the client is trying to renew the session 0x143b92b03330000, but sees an expiration.

              [junit] 2014-01-22 08:57:23,488 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11232:ZooKeeperServer@861] - Client attempting to renew session 0x143b92b03330000 at /127.0.0.1:56319
              [junit] 2014-01-22 08:57:23,488 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11232:ZooKeeperServer@610] - Invalid session 0x143b92b03330000 for client /127.0.0.1:56319, probably expired
              [junit] 2014-01-22 08:57:23,488 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11232:NIOServerCnxn@1007] - Closed socket connection for client /127.0.0.1:56319 which had sessionid 0x143b92b03330000
              [junit] 2014-01-22 08:57:23,489 [myid:] - INFO  [main-SendThread(127.0.0.1:11232):ClientCnxn$SendThread@1087] - Unable to reconnect to ZooKeeper service, session 0x143b92b03330000 has expired, closing socket connection
              [junit] 2014-01-22 08:57:23,489 [myid:] - INFO  [main:JMXEnv@113] - expect:0x143b92b03330000
              [junit] 2014-01-22 08:57:23,489 [myid:] - INFO  [main-EventThread:ClientCnxn$EventThread@509] - EventThread shut down
          

          I think, the server has closed the session 0x143b92b03330000 and removed the connection. From the logs, I couldn't see more information about expiration or session closure that has occured in the zk server.

          Show
          Rakesh R added a comment - Reference : https://builds.apache.org/job/ZooKeeper-3.4-WinVS2008_java/416/ I've done intial analysis and please see the life span of the session 0x143b92b03330000: Phase#1 : Here it has established the connection [junit] 2014-01-22 08:57:20,683 [myid:] - INFO [main:ZooKeeper@438] - Initiating client connection, connectString=127.0.0.1:11232 sessionTimeout=30000 watcher=org.apache.zookeeper.test.ClientBase$CountdownWatcher@406be93e [junit] 2014-01-22 08:57:20,739 [myid:] - INFO [main-SendThread(127.0.0.1:11232):ClientCnxn$SendThread@1228] - Session establishment complete on server 127.0.0.1/127.0.0.1:11232, sessionid = 0x143b92b03330000, negotiated timeout = 30000 Phase#2 : Here it shows the client is not able to get the response from the zk server [junit] 2014-01-22 08:57:20,740 [myid:] - INFO [main:JMXEnv@113] - expect:0x143b92b03330000 [junit] 2014-01-22 08:57:20,841 [myid:] - INFO [main:JMXEnv@113] - expect:0x143b92b03330000 [junit] 2014-01-22 08:57:21,437 [myid:] - INFO [main-SendThread(127.0.0.1:11232):ClientCnxn$SendThread@1091] - Unable to read additional data from server sessionid 0x143b92b03330000, likely server has closed socket, closing socket connection and attempting reconnect Phase#3 : Here the client is trying to renew the session 0x143b92b03330000, but sees an expiration. [junit] 2014-01-22 08:57:23,488 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11232:ZooKeeperServer@861] - Client attempting to renew session 0x143b92b03330000 at /127.0.0.1:56319 [junit] 2014-01-22 08:57:23,488 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11232:ZooKeeperServer@610] - Invalid session 0x143b92b03330000 for client /127.0.0.1:56319, probably expired [junit] 2014-01-22 08:57:23,488 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11232:NIOServerCnxn@1007] - Closed socket connection for client /127.0.0.1:56319 which had sessionid 0x143b92b03330000 [junit] 2014-01-22 08:57:23,489 [myid:] - INFO [main-SendThread(127.0.0.1:11232):ClientCnxn$SendThread@1087] - Unable to reconnect to ZooKeeper service, session 0x143b92b03330000 has expired, closing socket connection [junit] 2014-01-22 08:57:23,489 [myid:] - INFO [main:JMXEnv@113] - expect:0x143b92b03330000 [junit] 2014-01-22 08:57:23,489 [myid:] - INFO [main-EventThread:ClientCnxn$EventThread@509] - EventThread shut down I think, the server has closed the session 0x143b92b03330000 and removed the connection. From the logs, I couldn't see more information about expiration or session closure that has occured in the zk server.
          Hide
          Germán Blanco added a comment -

          I see a lot of "Refusing session request for client ... as it has seen zxid 0x100000003 our last zxid is 0x0 client must try another server" in that test case.

          Show
          Germán Blanco added a comment - I see a lot of "Refusing session request for client ... as it has seen zxid 0x100000003 our last zxid is 0x0 client must try another server" in that test case.
          Germán Blanco made changes -
          Assignee Germán Blanco [ abranzyck ]
          Hide
          Germán Blanco added a comment -

          One could fix createClient so that it checks JMXEnv and if the client is not there, then it closes the client and tries again.
          That might work, and I will prepare the patch in case anyone wants to give it a try.
          However, it is very strange that we have at least one case in which the client seems to see zxid=0x1000..0003 and the server only zxid=0x0. I think it would be worthwhile to look at how that happened, but for that we might need e.g. more traces ... and if we increase the debugging level a lot of things could happen since we have seen that many of these issues are related with overload in CPU already with the current debugging level.

          Show
          Germán Blanco added a comment - One could fix createClient so that it checks JMXEnv and if the client is not there, then it closes the client and tries again. That might work, and I will prepare the patch in case anyone wants to give it a try. However, it is very strange that we have at least one case in which the client seems to see zxid=0x1000..0003 and the server only zxid=0x0. I think it would be worthwhile to look at how that happened, but for that we might need e.g. more traces ... and if we increase the debugging level a lot of things could happen since we have seen that many of these issues are related with overload in CPU already with the current debugging level.
          Hide
          Germán Blanco added a comment -

          patch with a loop in createClient. Works for trunk and branch3.4.

          Show
          Germán Blanco added a comment - patch with a loop in createClient. Works for trunk and branch3.4.
          Germán Blanco made changes -
          Attachment ZOOKEEPER-1866.patch [ 12624341 ]
          Germán Blanco made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12624341/ZOOKEEPER-1866.patch
          against trunk revision 1560172.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1896//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1896//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1896//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12624341/ZOOKEEPER-1866.patch against trunk revision 1560172. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1896//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1896//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1896//console This message is automatically generated.
          Hide
          Flavio Junqueira added a comment -

          Thanks for proposing a patch, Germán Blanco. I think we shouldn´t take it without understanding a bit better what's going on here. The observation about the zxids is a bit odd, although possible. Unfortunately, I can't access builds.a.o. right now, but I'll try to check it later on. If you have any more insight in the meanwhile, please post.

          Show
          Flavio Junqueira added a comment - Thanks for proposing a patch, Germán Blanco . I think we shouldn´t take it without understanding a bit better what's going on here. The observation about the zxids is a bit odd, although possible. Unfortunately, I can't access builds.a.o. right now, but I'll try to check it later on. If you have any more insight in the meanwhile, please post.
          Flavio Junqueira made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Rakesh R added a comment -

          Hi German/Flavio,
          I'm suspecting a case where old server/quorum (previous test case is not done the cleanups) is still running. I've raised separate jira ZOOKEEPER-1872 to handle the proper cleanups independently, probably that could be one of the reason for this issue. Please have a look at it.

          Show
          Rakesh R added a comment - Hi German/Flavio, I'm suspecting a case where old server/quorum (previous test case is not done the cleanups) is still running. I've raised separate jira ZOOKEEPER-1872 to handle the proper cleanups independently, probably that could be one of the reason for this issue. Please have a look at it.
          Germán Blanco made changes -
          Link This issue relates to ZOOKEEPER-1872 [ ZOOKEEPER-1872 ]
          Hide
          Germán Blanco added a comment -

          ZOOKEEPER-1872 could solve this problem, so let's finish that one and see if this error happens again.

          Show
          Germán Blanco added a comment - ZOOKEEPER-1872 could solve this problem, so let's finish that one and see if this error happens again.
          Michi Mutsuzaki made changes -
          Fix Version/s 3.4.7 [ 12325149 ]
          Fix Version/s 3.4.6 [ 12323310 ]

            People

            • Assignee:
              Germán Blanco
              Reporter:
              Rakesh R
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development