ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-344

doIO in NioServerCnxn: Exception causing close of session : cause is "read error"

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: 3.1.0
    • Fix Version/s: 3.2.0
    • Component/s: java client, server
    • Labels:
      None
    • Environment:

      jdk1.6.0_07
      Linux blade2 2.6.27.7-134.fc10.x86_64 #1 SMP Mon Dec 1 22:21:35 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

    • Hadoop Flags:
      Reviewed

      Description

      I have been having a problem with zookeeper 3.0.1 and now with 3.1.0 where I see a lot of expired sessions. I am using a 16 node cluster which is all on the same local network. There is a single zookeeper instance (these are benchmarking runs).
      The problem appears to be correlated with either run time or system load.\

      Personally I think that it is system load because I have session session expired events under a Windows platform running zookeeper and the application (i.e., everthing is local) when the application load suddenly spikes. To me this suggests that the client is not able to renew (ping) the zookeeper service in a timely manner and is expired. But the log messages below with the "read error" suggest that maybe there is something else going on?

      Zookeeper Configuration
      #Wed Mar 18 12:41:05 GMT-05:00 2009
      clientPort=2181
      dataDir=/var/bigdata/benchmark/zookeeper/1
      syncLimit=2
      dataLogDir=/var/bigdata/benchmark/zookeeper/1
      tickTime=2000

      Some representative log messages are below.

      Client side messages (from our app)
      ERROR [main-EventThread] com.bigdata.zookeeper.ZLockImpl$ZLockWatcher.process(ZLockImpl.java:400) 2009-03-18 13:35:40,335 - Session expired: WatchedEvent: Server state change. New state: Expired : zpath=/benchmark/jobs/com.bigdata.service.jini.benchmark.ThroughputMaster/test_1/client1160/locknode
      ERROR [main-EventThread] com.bigdata.zookeeper.ZLockImpl$ZLockWatcher.process(ZLockImpl.java:400) 2009-03-18 13:35:40,335 - Session expired: WatchedEvent: Server state change. New state: Expired : zpath=/benchmark/jobs/com.bigdata.service.jini.benchmark.ThroughputMaster/test_1/client1356/locknode

      Server side messages:
      WARN [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:417) 2009-03-18 13:06:57,252 - Exception causing close of session 0x1201aac14300022 due to java.io.IOException: Read error
      WARN [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:417) 2009-03-18 13:06:58,198 - Exception causing close of session 0x1201aac1430000f due to java.io.IOException: Read error

        Activity

        Patrick Hunt made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Patrick Hunt made changes -
        Assignee Patrick Hunt [ phunt ]
        Patrick Hunt made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Invalid [ 6 ]
        Hide
        Patrick Hunt added a comment -

        Closing the issue, user tracked this down to client swapping issues.

        Show
        Patrick Hunt added a comment - Closing the issue, user tracked this down to client swapping issues.
        Hide
        Patrick Hunt added a comment -

        Bryan, thanks for the update, I have updated the troubleshooting page with your feedback:
        http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

        Show
        Patrick Hunt added a comment - Bryan, thanks for the update, I have updated the troubleshooting page with your feedback: http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting
        Hide
        bryan thompson added a comment -

        Update:

        This issue is clearly linked to heavy utilization or swapping on the clients. I find that if I keep the clients from swapping that this error
        materializes relatively infrequently, and when it does materialize it is linked to a sudden increase in load. For example, the concurrent
        start of 100 clients on 14 machines will sometimes trigger this issue. I believe that the issue can be closed at this point with the note
        that swapping will cause expired connections. I also observe similar problems with jini / river, including cases where DGC (distributed
        garbage collection) appears to fail. All in all, it is my sense that Java processes must avoid swapping if they want to have not just timely
        but also reliable behavior.

        Thanks,

        -bryan

        Show
        bryan thompson added a comment - Update: This issue is clearly linked to heavy utilization or swapping on the clients. I find that if I keep the clients from swapping that this error materializes relatively infrequently, and when it does materialize it is linked to a sudden increase in load. For example, the concurrent start of 100 clients on 14 machines will sometimes trigger this issue. I believe that the issue can be closed at this point with the note that swapping will cause expired connections. I also observe similar problems with jini / river, including cases where DGC (distributed garbage collection) appears to fail. All in all, it is my sense that Java processes must avoid swapping if they want to have not just timely but also reliable behavior. Thanks, -bryan
        Hide
        Mahadev konar added a comment -

        brayn,
        the one thing you can do is run with tracefile option in the config. Please take a look at
        http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperAdmin.html to see how to set up a tracefile. The tracefile has logs of all the transactions that go though the server and all
        the components of the server and helps in debugging problems such as these, where you can point out which transaction got delayed and at what time and sometimes can point out
        the reason why...

        Show
        Mahadev konar added a comment - brayn, the one thing you can do is run with tracefile option in the config. Please take a look at http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperAdmin.html to see how to set up a tracefile. The tracefile has logs of all the transactions that go though the server and all the components of the server and helps in debugging problems such as these, where you can point out which transaction got delayed and at what time and sometimes can point out the reason why...
        Hide
        bryan thompson added a comment -

        I am not sure how to boil this down into a problem which can be run on a single machine. This is a distributed database benchmark. The problem shows up when the cluster is under load. How would I go about isolating that further outside of writing stress tests for zookeeper?

        If this is indeed a zookeeper bug and you have some idea of the possible issues involved, then perhaps you can suggest some additional instrumentation of zookeeper and I could run against a version with more instrumentation which might reveal something?

        Thanks,

        -bryan

        Show
        bryan thompson added a comment - I am not sure how to boil this down into a problem which can be run on a single machine. This is a distributed database benchmark. The problem shows up when the cluster is under load. How would I go about isolating that further outside of writing stress tests for zookeeper? If this is indeed a zookeeper bug and you have some idea of the possible issues involved, then perhaps you can suggest some additional instrumentation of zookeeper and I could run against a version with more instrumentation which might reveal something? Thanks, -bryan
        Hide
        Mahadev konar added a comment -

        hmm... Can you create a sample test application that I could run on my machine and see if I reproduce this on my linux machines?

        Show
        Mahadev konar added a comment - hmm... Can you create a sample test application that I could run on my machine and see if I reproduce this on my linux machines?
        Hide
        bryan thompson added a comment -

        It is a Linux platform, which I describe above. it is a standalone instance however rather than an ensemble.

        Show
        bryan thompson added a comment - It is a Linux platform, which I describe above. it is a standalone instance however rather than an ensemble.
        Hide
        Mahadev konar added a comment -

        for the first trace, closed != expired is true. it seems like the client reconnected to the server and didnt expire in the first case.

        in the second case it did expire.

        You are on a windows machine right? and also its a single server ensemble? I dont usually use a windows box but I did notice once that creating a log file with size of 64MB initially took
        a lot of time (5-15 seconds and was repeatable) on my windows box with no other load . Just to give you background, zookeeper takes snaphots and has transaction logs. The transaction logs
        are created with an initialize size of 64MB so that we do not have to update the size of the file repeatedly. This operation did take a lot of time on my windows machine and this did cause huge
        latencies in my zookeeper clients and lots of timeouts and session expired.

        Ill take another look at your logs to see if i can find something that might explain the frequent session timeouts.

        Show
        Mahadev konar added a comment - for the first trace, closed != expired is true. it seems like the client reconnected to the server and didnt expire in the first case. in the second case it did expire. You are on a windows machine right? and also its a single server ensemble? I dont usually use a windows box but I did notice once that creating a log file with size of 64MB initially took a lot of time (5-15 seconds and was repeatable) on my windows box with no other load . Just to give you background, zookeeper takes snaphots and has transaction logs. The transaction logs are created with an initialize size of 64MB so that we do not have to update the size of the file repeatedly. This operation did take a lot of time on my windows machine and this did cause huge latencies in my zookeeper clients and lots of timeouts and session expired. Ill take another look at your logs to see if i can find something that might explain the frequent session timeouts.
        Hide
        bryan thompson added a comment -

        Here are some more stack traces with DEBUG on the server and the client for this issue. The configuration, etc. is the same. Logs were written onto an NFS share but the machines are synched with ntpd.

        There are two distinct periods reported here. One leads to a warning on the server but not to an expired session while the other issues the same warning on the server and leads to an expired session.

        Here is a "ping" for sessionid 0x120597b6137000b shortly before the warning.

        DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74) 2009-03-30 17:34:33,643 - Processing request:: sessionid:0x120597b6137000b type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown n/a
        DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:137) 2009-03-30 17:34:33,643 - sessionid:0x120597b6137000b type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown n/a
        DEBUG [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:548) 2009-03-30 17:34:33,643 - Got ping response for sessionid:0x120597b6137000b after 1ms

        Here is the "Exception causing close of session".

        WARN [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:898) 2009-03-30 17:34:48,120 - Exception closing session 0x120597b6137000b to sun.nio.ch.SelectionKeyImpl@7eb1cc87
        java.io.IOException: TIMED OUT
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:837)
        WARN [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:417) 2009-03-30 17:34:48,166 - Exception causing close of session 0x120597b6137000b due to java.io.IOException: Read error
        DEBUG [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:420) 2009-03-30 17:34:48,166 - IOException stack trace
        java.io.IOException: Read error
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
        at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)
        INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:752) 2009-03-30 17:34:48,172 - closing session:0x120597b6137000b NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.6.21:2181 remote=/192.168.6.28:60720]

        And here is appears that the closed session was re-initialized? Perhaps closed != expired?

        INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.finishSessionInit(NIOServerCnxn.java:881) 2009-03-30 17:34:50,111 - Finished init of 0x120597b6137000b valid:true
        INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:531) 2009-03-30 17:34:50,111 - Renewing session 0x120597b6137000b
        DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74) 2009-03-30 17:34:50,112 - Processing request:: sessionid:0x120597b6137000b type:setWatches cxid:0xfffffffffffffff8 zxid:0xfffffffffffffffe txntype:unknown
        DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:137) 2009-03-30 17:34:50,112 - sessionid:0x120597b6137000b type:setWatches cxid:0xfffffffffffffff8 zxid:0xfffffffffffffffe txntype:unknown
        DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74) 2009-03-30 17:34:50,121 - Processing request:: sessionid:0x120597b6137000b type:create cxid:0x8 zxid:0xb95 txntype:-1 n/a
        DEBUG [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:610) 2009-03-30 17:34:50,126 - Reading reply sessionid:0x120597b6137000b, packet:: path:null finished:false header:: -8,101 replyHeader:: -8,2964,0 request:: 2964,v{},v

        {'/benchmark/config/com.bigdata.service.jini.DataServer/logicalService0000000011/masterElection_INVALID}

        ,v{} response:: null

        ------------------------------------------------------------

        Another trace that leads to an expired session:

        DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74) 2009-03-30 18:35:39,478 - Processing request:: sessionid:0x120597b61370008 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown n/a
        DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:137) 2009-03-30 18:35:39,478 - sessionid:0x120597b61370008 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown n/a
        DEBUG [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:548) 2009-03-30 18:35:39,478 - Got ping response for sessionid:0x120597b61370008 after 1ms

        WARN [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:898) 2009-03-30 18:35:58,082 - Exception closing session 0x120597b61370008 to sun.nio.ch.SelectionKeyImpl@6e731a5b
        java.io.IOException: TIMED OUT
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:837)
        WARN [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:417) 2009-03-30 18:35:58,084 - Exception causing close of session 0x120597b61370008 due to java.io.IOException: Read error
        DEBUG [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:420) 2009-03-30 18:35:58,084 - IOException stack trace
        java.io.IOException: Read error
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
        at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)
        INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:752) 2009-03-30 18:35:58,085 - closing session:0x120597b61370008 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.6.21:2181 remote=/192.168.6.34:58347]

        INFO [SessionTracker] org.apache.zookeeper.server.SessionTrackerImpl.run(SessionTrackerImpl.java:132) 2009-03-30 18:36:00,000 - Expiring session 0x120597b61370008
        INFO [SessionTracker] org.apache.zookeeper.server.ZooKeeperServer.expire(ZooKeeperServer.java:317) 2009-03-30 18:36:00,001 - Expiring session 0x120597b61370008
        INFO [ProcessThread:-1] org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:360) 2009-03-30 18:36:00,003 - Processed session termination request for id: 0x120597b61370008
        DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74) 2009-03-30 18:36:00,013 - Processing request:: sessionid:0x120597b61370008 type:closeSession cxid:0x0 zxid:0xbaf txntype:-11 n/a
        DEBUG [SyncThread:0] org.apache.zookeeper.server.DataTree.killSession(DataTree.java:743) 2009-03-30 18:36:00,014 - Deleting ephemeral node /benchmark/config/com.bigdata.service.jini.DataServer/logicalService0000000005/masterElection/lock0000000000 for session 0x120597b61370008
        INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:503) 2009-03-30 18:36:00,072 - Connected to /192.168.6.34:45222 lastZxid 2984
        INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.finishSessionInit(NIOServerCnxn.java:881) 2009-03-30 18:36:00,073 - Finished init of 0x120597b61370008 valid:false
        INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:531) 2009-03-30 18:36:00,073 - Renewing session 0x120597b61370008
        WARN [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:417) 2009-03-30 18:36:00,074 - Exception causing close of session 0x120597b61370008 due to java.io.IOException: closing
        DEBUG [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:420) 2009-03-30 18:36:00,074 - IOException stack trace
        java.io.IOException: closing
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:373)
        at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)
        INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:752) 2009-03-30 18:36:00,074 - closing session:0x120597b61370008 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.6.21:2181 remote=/192.168.6.34:4522\
        2]

        INFO [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:800) 2009-03-30 18:36:00,069 - Attempting connection to server blade2/192.168.6.21:2181
        INFO [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.primeConnection(ClientCnxn.java:716) 2009-03-30 18:36:00,070 - Priming connection to java.nio.channels.SocketChannel[connected local=/192.168.6.34:45222 remote=blade2/192.168.6.21:2181]
        INFO [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:868) 2009-03-30 18:36:00,070 - Server connection successful
        WARN [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:898) 2009-03-30 18:36:00,073 - Exception closing session 0x120597b61370008 to sun.nio.ch.SelectionKeyImpl@3d38013d
        java.io.IOException: Session Expired
        at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:527)
        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:640)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:876)
        WARN [main-EventThread] com.bigdata.service.jini.AbstractServer$1.process(AbstractServer.java:759) 2009-03-30 18:36:00,074 - Zookeeper session expired: cancelled master election task: com.bigdata.service.jini.AbstractServer$1@79b5334c
        ERROR [main-EventThread] com.bigdata.zookeeper.ZLockImpl$ZLockWatcher.process(ZLockImpl.java:400) 2009-03-30 18:36:00,074 - Session expired: WatchedEvent: Server state change. New state: Expired : zxid=120597b61370008, zpath=/benchmark/config/com.bigdata.service.jini.DataServer/logicalService0000000005/masterElection
        WARN [com.bigdata.service.jini.JiniFederation.executorService1] com.bigdata.zookeeper.ZLockImpl.unlock(ZLockImpl.java:1188) 2009-03-30 18:36:00,077 - Session expired: zpath=/benchmark/config/com.bigdata.service.jini.DataServer/logicalService0000000005/masterElection, child=lock0000000000 : org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired

        Show
        bryan thompson added a comment - Here are some more stack traces with DEBUG on the server and the client for this issue. The configuration, etc. is the same. Logs were written onto an NFS share but the machines are synched with ntpd. There are two distinct periods reported here. One leads to a warning on the server but not to an expired session while the other issues the same warning on the server and leads to an expired session. Here is a "ping" for sessionid 0x120597b6137000b shortly before the warning. DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74) 2009-03-30 17:34:33,643 - Processing request:: sessionid:0x120597b6137000b type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown n/a DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:137) 2009-03-30 17:34:33,643 - sessionid:0x120597b6137000b type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown n/a DEBUG [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:548) 2009-03-30 17:34:33,643 - Got ping response for sessionid:0x120597b6137000b after 1ms Here is the "Exception causing close of session". WARN [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:898) 2009-03-30 17:34:48,120 - Exception closing session 0x120597b6137000b to sun.nio.ch.SelectionKeyImpl@7eb1cc87 java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:837) WARN [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:417) 2009-03-30 17:34:48,166 - Exception causing close of session 0x120597b6137000b due to java.io.IOException: Read error DEBUG [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:420) 2009-03-30 17:34:48,166 - IOException stack trace java.io.IOException: Read error at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295) at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162) INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:752) 2009-03-30 17:34:48,172 - closing session:0x120597b6137000b NIOServerCnxn: java.nio.channels.SocketChannel [connected local=/192.168.6.21:2181 remote=/192.168.6.28:60720] And here is appears that the closed session was re-initialized? Perhaps closed != expired? INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.finishSessionInit(NIOServerCnxn.java:881) 2009-03-30 17:34:50,111 - Finished init of 0x120597b6137000b valid:true INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:531) 2009-03-30 17:34:50,111 - Renewing session 0x120597b6137000b DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74) 2009-03-30 17:34:50,112 - Processing request:: sessionid:0x120597b6137000b type:setWatches cxid:0xfffffffffffffff8 zxid:0xfffffffffffffffe txntype:unknown DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:137) 2009-03-30 17:34:50,112 - sessionid:0x120597b6137000b type:setWatches cxid:0xfffffffffffffff8 zxid:0xfffffffffffffffe txntype:unknown DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74) 2009-03-30 17:34:50,121 - Processing request:: sessionid:0x120597b6137000b type:create cxid:0x8 zxid:0xb95 txntype:-1 n/a DEBUG [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:610) 2009-03-30 17:34:50,126 - Reading reply sessionid:0x120597b6137000b, packet:: path:null finished:false header:: -8,101 replyHeader:: -8,2964,0 request:: 2964,v{},v {'/benchmark/config/com.bigdata.service.jini.DataServer/logicalService0000000011/masterElection_INVALID} ,v{} response:: null ------------------------------------------------------------ Another trace that leads to an expired session: DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74) 2009-03-30 18:35:39,478 - Processing request:: sessionid:0x120597b61370008 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown n/a DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:137) 2009-03-30 18:35:39,478 - sessionid:0x120597b61370008 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown n/a DEBUG [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:548) 2009-03-30 18:35:39,478 - Got ping response for sessionid:0x120597b61370008 after 1ms WARN [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:898) 2009-03-30 18:35:58,082 - Exception closing session 0x120597b61370008 to sun.nio.ch.SelectionKeyImpl@6e731a5b java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:837) WARN [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:417) 2009-03-30 18:35:58,084 - Exception causing close of session 0x120597b61370008 due to java.io.IOException: Read error DEBUG [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:420) 2009-03-30 18:35:58,084 - IOException stack trace java.io.IOException: Read error at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295) at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162) INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:752) 2009-03-30 18:35:58,085 - closing session:0x120597b61370008 NIOServerCnxn: java.nio.channels.SocketChannel [connected local=/192.168.6.21:2181 remote=/192.168.6.34:58347] INFO [SessionTracker] org.apache.zookeeper.server.SessionTrackerImpl.run(SessionTrackerImpl.java:132) 2009-03-30 18:36:00,000 - Expiring session 0x120597b61370008 INFO [SessionTracker] org.apache.zookeeper.server.ZooKeeperServer.expire(ZooKeeperServer.java:317) 2009-03-30 18:36:00,001 - Expiring session 0x120597b61370008 INFO [ProcessThread:-1] org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:360) 2009-03-30 18:36:00,003 - Processed session termination request for id: 0x120597b61370008 DEBUG [SyncThread:0] org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74) 2009-03-30 18:36:00,013 - Processing request:: sessionid:0x120597b61370008 type:closeSession cxid:0x0 zxid:0xbaf txntype:-11 n/a DEBUG [SyncThread:0] org.apache.zookeeper.server.DataTree.killSession(DataTree.java:743) 2009-03-30 18:36:00,014 - Deleting ephemeral node /benchmark/config/com.bigdata.service.jini.DataServer/logicalService0000000005/masterElection/lock0000000000 for session 0x120597b61370008 INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:503) 2009-03-30 18:36:00,072 - Connected to /192.168.6.34:45222 lastZxid 2984 INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.finishSessionInit(NIOServerCnxn.java:881) 2009-03-30 18:36:00,073 - Finished init of 0x120597b61370008 valid:false INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:531) 2009-03-30 18:36:00,073 - Renewing session 0x120597b61370008 WARN [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:417) 2009-03-30 18:36:00,074 - Exception causing close of session 0x120597b61370008 due to java.io.IOException: closing DEBUG [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:420) 2009-03-30 18:36:00,074 - IOException stack trace java.io.IOException: closing at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:373) at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162) INFO [NIOServerCxn.Factory:2181] org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:752) 2009-03-30 18:36:00,074 - closing session:0x120597b61370008 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.6.21:2181 remote=/192.168.6.34:4522\ 2] INFO [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:800) 2009-03-30 18:36:00,069 - Attempting connection to server blade2/192.168.6.21:2181 INFO [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.primeConnection(ClientCnxn.java:716) 2009-03-30 18:36:00,070 - Priming connection to java.nio.channels.SocketChannel [connected local=/192.168.6.34:45222 remote=blade2/192.168.6.21:2181] INFO [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:868) 2009-03-30 18:36:00,070 - Server connection successful WARN [main-SendThread] org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:898) 2009-03-30 18:36:00,073 - Exception closing session 0x120597b61370008 to sun.nio.ch.SelectionKeyImpl@3d38013d java.io.IOException: Session Expired at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:527) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:640) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:876) WARN [main-EventThread] com.bigdata.service.jini.AbstractServer$1.process(AbstractServer.java:759) 2009-03-30 18:36:00,074 - Zookeeper session expired: cancelled master election task: com.bigdata.service.jini.AbstractServer$1@79b5334c ERROR [main-EventThread] com.bigdata.zookeeper.ZLockImpl$ZLockWatcher.process(ZLockImpl.java:400) 2009-03-30 18:36:00,074 - Session expired: WatchedEvent: Server state change. New state: Expired : zxid=120597b61370008, zpath=/benchmark/config/com.bigdata.service.jini.DataServer/logicalService0000000005/masterElection WARN [com.bigdata.service.jini.JiniFederation.executorService1] com.bigdata.zookeeper.ZLockImpl.unlock(ZLockImpl.java:1188) 2009-03-30 18:36:00,077 - Session expired: zpath=/benchmark/config/com.bigdata.service.jini.DataServer/logicalService0000000005/masterElection, child=lock0000000000 : org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
        Hide
        Patrick Hunt added a comment -

        That's fine. I guess what I mean is that it would be interesting to see the debug logs
        for both the server and client at the time that the issue(s) start. We might get
        more insight into what is happening if we can do that.

        Show
        Patrick Hunt added a comment - That's fine. I guess what I mean is that it would be interesting to see the debug logs for both the server and client at the time that the issue(s) start. We might get more insight into what is happening if we can do that.
        Hide
        bryan thompson added a comment -

        Patrick, I did not try to coordinate the client and server logs but rather drew representative samples from each. As far as I can tell it is more of the same in both logs. However, I will correlate the events and the zxids and see if I can get that debug trace you suggested. -bryan

        Show
        bryan thompson added a comment - Patrick, I did not try to coordinate the client and server logs but rather drew representative samples from each. As far as I can tell it is more of the same in both logs. However, I will correlate the events and the zxids and see if I can get that debug trace you suggested. -bryan
        Hide
        Patrick Hunt added a comment -

        Bryan, that's good info. It doesn't sound like zk server latency is the issue then, you have an excess
        of cpu/memory based on the tests you are running, however it will be good to verify using jmx or the
        stat command.

        If you can run with DEBUG logging enabled (server and client) it might give you more insight. Also running
        at DEBUG level will cause the stack of the "read error" you are seeing to be printed to the server log (zk
        version 3.1). If you can share all/part of the logs please feel free to attach them to this JIRA.

        It's probably this code in server doIO though that's causing the server side "read error" exception you are seeing:

        int rc = sock.read(incomingBuffer);
        if (rc < 0)

        { throw new IOException("Read error"); }

        read returns "The number of bytes read, possibly zero, or -1 if the channel has reached end-of-stream"

        this indicates to me that the client has closed the connection.

        Also, looking at your logs the client log is from 13:35 while the server log is from 13:06, assuming that the
        clocks are even fairly close this is almost 30min difference, if true it's unlikely the events are correlated?

        My guess is that the client is closing the connection for some reason, but it would be interesting to see
        the debug logs (with clocks that are fairly close on server/client so it would be easier to correlate the log
        events).

        Hope this helps.

        Show
        Patrick Hunt added a comment - Bryan, that's good info. It doesn't sound like zk server latency is the issue then, you have an excess of cpu/memory based on the tests you are running, however it will be good to verify using jmx or the stat command. If you can run with DEBUG logging enabled (server and client) it might give you more insight. Also running at DEBUG level will cause the stack of the "read error" you are seeing to be printed to the server log (zk version 3.1). If you can share all/part of the logs please feel free to attach them to this JIRA. It's probably this code in server doIO though that's causing the server side "read error" exception you are seeing: int rc = sock.read(incomingBuffer); if (rc < 0) { throw new IOException("Read error"); } read returns "The number of bytes read, possibly zero, or -1 if the channel has reached end-of-stream" this indicates to me that the client has closed the connection. Also, looking at your logs the client log is from 13:35 while the server log is from 13:06, assuming that the clocks are even fairly close this is almost 30min difference, if true it's unlikely the events are correlated? My guess is that the client is closing the connection for some reason, but it would be interesting to see the debug logs (with clocks that are fairly close on server/client so it would be easier to correlate the log events). Hope this helps.
        Hide
        bryan thompson added a comment -

        I missed the question about the zk server. It is an 8 core (2 quad core Opterons) 4x512k cache, 2.3Ghz clock with 32G ram.

        Show
        bryan thompson added a comment - I missed the question about the zk server. It is an 8 core (2 quad core Opterons) 4x512k cache, 2.3Ghz clock with 32G ram.
        Hide
        bryan thompson added a comment -

        Let me clarify a few things based on the other comments:

        1. The sessionTimeout for the client was set to 20000.

        2. The zookeeper server is running on a host with very little total load (very little CPU utilization and very low disk write rates). There is only one disk available for the zookeeper transaction log. It is a SAS 10k spindle with a 16M cache.

        3. The zookeeper server process has 4G of RAM.

        4. The benchmark is not a zookeeper benchmark, but a database benchmark. Zookeeper is being used for distributed locks and master elections. There is relatively little activity for the zookeeper server.

        I will modify the logged message to record the zxid and report back some correlated events.

        I will also report the output of the stat command from the server for several times during the run / JXM, which I've enabled.

        Show
        bryan thompson added a comment - Let me clarify a few things based on the other comments: 1. The sessionTimeout for the client was set to 20000. 2. The zookeeper server is running on a host with very little total load (very little CPU utilization and very low disk write rates). There is only one disk available for the zookeeper transaction log. It is a SAS 10k spindle with a 16M cache. 3. The zookeeper server process has 4G of RAM. 4. The benchmark is not a zookeeper benchmark, but a database benchmark. Zookeeper is being used for distributed locks and master elections. There is relatively little activity for the zookeeper server. I will modify the logged message to record the zxid and report back some correlated events. I will also report the output of the stat command from the server for several times during the run / JXM, which I've enabled.
        Patrick Hunt made changes -
        Field Original Value New Value
        Fix Version/s 3.2.0 [ 12313491 ]
        Component/s server [ 12312382 ]
        Hide
        Patrick Hunt added a comment -

        Hi Bryan, you might also try looking at some of the statistics using the "stat" command:
        http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands
        this will give you insight on the min/max/avg latency of requests. You could also use JMX if that works for you:
        http://hadoop.apache.org/zookeeper/docs/current/zookeeperJMX.html

        What is the timeout value you are using for your ZK clients? If your max latency is exceeding your client
        timeouts then you will definitely see expirations.

        Secondly review this section, specifically related to tranaction log placement and jdk memory (swapping) issues:
        http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_commonProblems
        Either of these issues can cause performance to dip, and latencies to increase.

        This information, along with a bit more detail on your benchmark would help you/us identify what's causing
        these issues. Re your benchmark, how many operations/sec are you running? What's the read/write split?

        Your zk server is a single quad-core x86_64 cpu, correct?

        Show
        Patrick Hunt added a comment - Hi Bryan, you might also try looking at some of the statistics using the "stat" command: http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands this will give you insight on the min/max/avg latency of requests. You could also use JMX if that works for you: http://hadoop.apache.org/zookeeper/docs/current/zookeeperJMX.html What is the timeout value you are using for your ZK clients? If your max latency is exceeding your client timeouts then you will definitely see expirations. Secondly review this section, specifically related to tranaction log placement and jdk memory (swapping) issues: http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_commonProblems Either of these issues can cause performance to dip, and latencies to increase. This information, along with a bit more detail on your benchmark would help you/us identify what's causing these issues. Re your benchmark, how many operations/sec are you running? What's the read/write split? Your zk server is a single quad-core x86_64 cpu, correct?
        Hide
        Mahadev konar added a comment - - edited
        ERROR [main-EventThread] com.bigdata.zookeeper.ZLockImpl$ZLockWatcher.process(ZLockImpl.java:400) 2009-03-18 13:35:40,335 - Session expired: WatchedEvent: Server state change. New state: Expired : zpath=/benchmark/jobs/com.bigdata.service.jini.benchmark.ThroughputMaster/test_1/client1160/locknode
        ERROR [main-EventThread] com.bigdata.zookeeper.ZLockImpl$ZLockWatcher.process(ZLockImpl.java:400) 2009-03-18 13:35:40,335 - Session expired: WatchedEvent: Server state change. New state: Expired : zpath=/benchmark/jobs/com.bigdata.service.jini.benchmark.ThroughputMaster/test_1/client1356/locknode
        

        can you post corresponding session id's with these ?

        and also the logs related to their session closing with the timestamps (on the server side).

        Show
        Mahadev konar added a comment - - edited ERROR [main-EventThread] com.bigdata.zookeeper.ZLockImpl$ZLockWatcher.process(ZLockImpl.java:400) 2009-03-18 13:35:40,335 - Session expired: WatchedEvent: Server state change. New state: Expired : zpath=/benchmark/jobs/com.bigdata.service.jini.benchmark.ThroughputMaster/test_1/client1160/locknode ERROR [main-EventThread] com.bigdata.zookeeper.ZLockImpl$ZLockWatcher.process(ZLockImpl.java:400) 2009-03-18 13:35:40,335 - Session expired: WatchedEvent: Server state change. New state: Expired : zpath=/benchmark/jobs/com.bigdata.service.jini.benchmark.ThroughputMaster/test_1/client1356/locknode can you post corresponding session id's with these ? and also the logs related to their session closing with the timestamps (on the server side).
        bryan thompson created issue -

          People

          • Assignee:
            Patrick Hunt
            Reporter:
            bryan thompson
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development