ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-597

ASyncHammerTest is failing intermittently on hudson trunk

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.1.1, 3.2.1
    • Fix Version/s: 3.1.2, 3.2.2, 3.3.0
    • Component/s: tests
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      ASyncHammerTest is failing intermittently on hudson trunk. There is no clear reason why this is happening, but
      it seems from the logs that a session connection to a follower is failing during session establishment - the
      failure seems to be a problem either on the follower or leader. The server gets the session create request, but
      it stalls in the request processor pipeline. (we see it go in, but we do not see it com eout)

      unfortunately all efforts to reproduce this on non-hudson trunk have failed. Even trying to reproduce by
      running on hudson host itself (manually) has failed.

      We need to instrument the client session creation code in the test to dump the thread stack if the
      session creation fails.

      1. ZOOKEEPER-597-3.2.patch
        6 kB
        Mahadev konar
      2. ZOOKEEPER-597-3.1.patch
        4 kB
        Mahadev konar
      3. ZOOKEEPER-597_4.patch
        3 kB
        Mahadev konar
      4. ZOOKEEPER-597_3.patch
        2 kB
        Benjamin Reed
      5. ZOOKEEPER-597_3.patch
        0.7 kB
        Benjamin Reed
      6. ZOOKEEPER-597_2.patch
        0.7 kB
        Patrick Hunt
      7. ZOOKEEPER-597.patch
        4 kB
        Benjamin Reed
      8. ZOOKEEPER-597.patch
        4 kB
        Benjamin Reed
      9. ZOOKEEPER-597.patch
        3 kB
        Patrick Hunt

        Issue Links

          Activity

          Hide
          Patrick Hunt added a comment -

          this patch updates the test to log the threads/stacks if an error occurrs during session est.

          Show
          Patrick Hunt added a comment - this patch updates the test to log the threads/stacks if an error occurrs during session est.
          Hide
          Mahadev konar added a comment -

          I just committed this. thanks pat.

          Show
          Mahadev konar added a comment - I just committed this. thanks pat.
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #558 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/558/)

          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #558 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/558/ )
          Hide
          Patrick Hunt added a comment -

          still an issue, reopening

          Show
          Patrick Hunt added a comment - still an issue, reopening
          Hide
          Flavio Junqueira added a comment -

          I have added some more log messages on my own to track the session that causes a run of the test to fail, and here is what I got for the culprit:

          2009-12-01 13:04:07,883 - INFO  [FollowerRequestProcessor:1:CommitProcessor@167] - Adding request to queue (CP): 1254a2022040016
          2009-12-01 13:04:07,884 - INFO  [ProcessThread:-1:PrepRequestProcessor@353] - Processing create session in PRP: 1254a2022040016
          2009-12-01 13:04:07,884 - INFO  [ProcessThread:-1:CommitProcessor@167] - Adding request to queue (CP): 1254a2022040016
          2009-12-01 13:04:07,886 - INFO  [SyncThread:2:SendAckRequestProcessor@41] - Send ack is processing create session (SARP): 1254a2022040016
          2009-12-01 13:04:07,886 - INFO  [SyncThread:1:SendAckRequestProcessor@41] - Send ack is processing create session (SARP): 1254a2022040016
          2009-12-01 13:04:07,886 - WARN  [LeanerHandler-/127.0.0.1:57817:Leader@470] - Processing ack (Leader): 1254a2022040016, 1
          2009-12-01 13:04:07,886 - WARN  [SyncThread:0:Leader@470] - Processing ack (Leader): 1254a2022040016, 2
          2009-12-01 13:04:07,887 - WARN  [SyncThread:0:Leader@481] - Going to apply (Leader): 1254a2022040016, 2
          2009-12-01 13:04:07,887 - WARN  [CommitProcessor:0:Leader$ToBeAppliedRequestProcessor@542] - Applying (TBARP): 1254a2022040016
          2009-12-01 13:04:40,000 - INFO  [SessionTracker:ZooKeeperServer@327] - Expiring session 0x1254a2022040016, timeout of 30000ms exceeded
          2009-12-01 13:04:40,000 - INFO  [ProcessThread:-1:PrepRequestProcessor@386] - Processed session termination for sessionid: 0x1254a2022040016
          

          For a session that has been correctly established, we can see that there is an extra message for FinalRequestProcessor:

          2009-12-01 13:04:37,924 - INFO  [FollowerRequestProcessor:2:CommitProcessor@167] - Adding request to queue (CP): 2254a2022070017
          2009-12-01 13:04:37,924 - INFO  [ProcessThread:-1:PrepRequestProcessor@353] - Processing create session in PRP: 2254a2022070017
          2009-12-01 13:04:37,925 - INFO  [ProcessThread:-1:CommitProcessor@167] - Adding request to queue (CP): 2254a2022070017
          2009-12-01 13:04:37,925 - WARN  [SyncThread:0:Leader@470] - Processing ack (Leader): 2254a2022070017, 1
          2009-12-01 13:04:37,925 - INFO  [SyncThread:2:SendAckRequestProcessor@41] - Send ack is processing create session (SARP): 2254a2022070017
          2009-12-01 13:04:37,925 - WARN  [LeanerHandler-/127.0.0.1:57817:Leader@470] - Processing ack (Leader): 2254a2022070017, 2
          2009-12-01 13:04:37,926 - WARN  [LeanerHandler-/127.0.0.1:57817:Leader@481] - Going to apply (Leader): 2254a2022070017, 2
          2009-12-01 13:04:37,926 - WARN  [CommitProcessor:0:Leader$ToBeAppliedRequestProcessor@542] - Applying (TBARP): 2254a2022070017
          2009-12-01 13:04:37,926 - INFO  [SyncThread:1:SendAckRequestProcessor@41] - Send ack is processing create session (SARP): 2254a2022070017
          2009-12-01 13:04:37,926 - INFO  [CommitProcessor:2:FinalRequestProcessor@175] - Processing create session in FRP: 2254a2022070017
          

          It sounds like the createSession request goes as far as ToBeAppliedProcessor, but it doesn't make it to FinalRequestProcessor. If my observation is correct, I think it is getting lost between the two. Is that possible?

          Show
          Flavio Junqueira added a comment - I have added some more log messages on my own to track the session that causes a run of the test to fail, and here is what I got for the culprit: 2009-12-01 13:04:07,883 - INFO [FollowerRequestProcessor:1:CommitProcessor@167] - Adding request to queue (CP): 1254a2022040016 2009-12-01 13:04:07,884 - INFO [ProcessThread:-1:PrepRequestProcessor@353] - Processing create session in PRP: 1254a2022040016 2009-12-01 13:04:07,884 - INFO [ProcessThread:-1:CommitProcessor@167] - Adding request to queue (CP): 1254a2022040016 2009-12-01 13:04:07,886 - INFO [SyncThread:2:SendAckRequestProcessor@41] - Send ack is processing create session (SARP): 1254a2022040016 2009-12-01 13:04:07,886 - INFO [SyncThread:1:SendAckRequestProcessor@41] - Send ack is processing create session (SARP): 1254a2022040016 2009-12-01 13:04:07,886 - WARN [LeanerHandler-/127.0.0.1:57817:Leader@470] - Processing ack (Leader): 1254a2022040016, 1 2009-12-01 13:04:07,886 - WARN [SyncThread:0:Leader@470] - Processing ack (Leader): 1254a2022040016, 2 2009-12-01 13:04:07,887 - WARN [SyncThread:0:Leader@481] - Going to apply (Leader): 1254a2022040016, 2 2009-12-01 13:04:07,887 - WARN [CommitProcessor:0:Leader$ToBeAppliedRequestProcessor@542] - Applying (TBARP): 1254a2022040016 2009-12-01 13:04:40,000 - INFO [SessionTracker:ZooKeeperServer@327] - Expiring session 0x1254a2022040016, timeout of 30000ms exceeded 2009-12-01 13:04:40,000 - INFO [ProcessThread:-1:PrepRequestProcessor@386] - Processed session termination for sessionid: 0x1254a2022040016 For a session that has been correctly established, we can see that there is an extra message for FinalRequestProcessor: 2009-12-01 13:04:37,924 - INFO [FollowerRequestProcessor:2:CommitProcessor@167] - Adding request to queue (CP): 2254a2022070017 2009-12-01 13:04:37,924 - INFO [ProcessThread:-1:PrepRequestProcessor@353] - Processing create session in PRP: 2254a2022070017 2009-12-01 13:04:37,925 - INFO [ProcessThread:-1:CommitProcessor@167] - Adding request to queue (CP): 2254a2022070017 2009-12-01 13:04:37,925 - WARN [SyncThread:0:Leader@470] - Processing ack (Leader): 2254a2022070017, 1 2009-12-01 13:04:37,925 - INFO [SyncThread:2:SendAckRequestProcessor@41] - Send ack is processing create session (SARP): 2254a2022070017 2009-12-01 13:04:37,925 - WARN [LeanerHandler-/127.0.0.1:57817:Leader@470] - Processing ack (Leader): 2254a2022070017, 2 2009-12-01 13:04:37,926 - WARN [LeanerHandler-/127.0.0.1:57817:Leader@481] - Going to apply (Leader): 2254a2022070017, 2 2009-12-01 13:04:37,926 - WARN [CommitProcessor:0:Leader$ToBeAppliedRequestProcessor@542] - Applying (TBARP): 2254a2022070017 2009-12-01 13:04:37,926 - INFO [SyncThread:1:SendAckRequestProcessor@41] - Send ack is processing create session (SARP): 2254a2022070017 2009-12-01 13:04:37,926 - INFO [CommitProcessor:2:FinalRequestProcessor@175] - Processing create session in FRP: 2254a2022070017 It sounds like the createSession request goes as far as ToBeAppliedProcessor, but it doesn't make it to FinalRequestProcessor. If my observation is correct, I think it is getting lost between the two. Is that possible?
          Hide
          Flavio Junqueira added a comment -

          I have also been able to verify that createSession operations that do not complete are exiting FinalRequestProcessor.processRequest here (around line 138):

          if (request.cnxn == null) {
                      return;
          }
          

          which is executed before the switch/case block that would finalize the operation.

          Show
          Flavio Junqueira added a comment - I have also been able to verify that createSession operations that do not complete are exiting FinalRequestProcessor.processRequest here (around line 138): if (request.cnxn == null) { return; } which is executed before the switch/case block that would finalize the operation.
          Hide
          Patrick Hunt added a comment -

          according to the latest log the commit processor thread is exiting. I notice that we are not logging exceptions from that thread. We should include logging the exception as part of this fix. Really we need to add to the ThreadGroup – handle uncaught exceptions – log them at error level

          Show
          Patrick Hunt added a comment - according to the latest log the commit processor thread is exiting. I notice that we are not logging exceptions from that thread. We should include logging the exception as part of this fix. Really we need to add to the ThreadGroup – handle uncaught exceptions – log them at error level
          Hide
          Flavio Junqueira added a comment -

          It is ok to have cnxn = null in FinalRequestProcessor. For example, if a follower is forwarding a request, cnxn will be null for the leader. The problem, as Pat points out, seems to be that CommitProcessor is exiting at the follower that was supposed to finalize it.

          Here is the stack trace from a faulty run:

          java.nio.channels.CancelledKeyException
          	at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
          	at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
          	at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:350)
          	at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1065)
          	at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:360)
          	at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
          
          Show
          Flavio Junqueira added a comment - It is ok to have cnxn = null in FinalRequestProcessor. For example, if a follower is forwarding a request, cnxn will be null for the leader. The problem, as Pat points out, seems to be that CommitProcessor is exiting at the follower that was supposed to finalize it. Here is the stack trace from a faulty run: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:350) at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1065) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:360) at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
          Hide
          Patrick Hunt added a comment -

          added link to 602 for logging uncaught exception

          Show
          Patrick Hunt added a comment - added link to 602 for logging uncaught exception
          Hide
          Benjamin Reed added a comment -

          A runtime exception was killing the CommitProcessor.

          Show
          Benjamin Reed added a comment - A runtime exception was killing the CommitProcessor.
          Hide
          Mahadev konar added a comment -

          +1 ... one minor nit .. can you change Throwable to Exception?

          Show
          Mahadev konar added a comment - +1 ... one minor nit .. can you change Throwable to Exception?
          Hide
          Benjamin Reed added a comment -

          changes to cache Exception not Throwable.

          Show
          Benjamin Reed added a comment - changes to cache Exception not Throwable.
          Hide
          Benjamin Reed added a comment -

          Committed revision 886241.
          (waiting for a couple of successful runs before i close)

          Show
          Benjamin Reed added a comment - Committed revision 886241. (waiting for a couple of successful runs before i close)
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #565 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/565/)
          . ASyncHammerTest is failing intermittently on hudson trunk

          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #565 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/565/ ) . ASyncHammerTest is failing intermittently on hudson trunk
          Hide
          Patrick Hunt added a comment -

          log the throwable if commit rp exits

          Show
          Patrick Hunt added a comment - log the throwable if commit rp exits
          Hide
          Patrick Hunt added a comment -

          please commit 597_2 to trunk this will allow us to see why commit rp is exiting

          Show
          Patrick Hunt added a comment - please commit 597_2 to trunk this will allow us to see why commit rp is exiting
          Hide
          Mahadev konar added a comment -

          I just committed 597_2.

          Show
          Mahadev konar added a comment - I just committed 597_2.
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #569 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/569/)

          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #569 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/569/ )
          Hide
          Benjamin Reed added a comment -

          this installs an uncaught exception handler for all threads.

          Show
          Benjamin Reed added a comment - this installs an uncaught exception handler for all threads.
          Hide
          Benjamin Reed added a comment -

          this patch removes the deadlock and the useless flag that was causing the deadlock and installs a default uncaught exception handler.

          Show
          Benjamin Reed added a comment - this patch removes the deadlock and the useless flag that was causing the deadlock and installs a default uncaught exception handler.
          Hide
          Mahadev konar added a comment -

          +1 for the patch. i just committed this.

          Show
          Mahadev konar added a comment - +1 for the patch. i just committed this.
          Hide
          Mahadev konar added a comment -

          looks like the build is normal now.

          http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/571/

          ill let the hudson trunk build run a few more times and then close this jira.

          Show
          Mahadev konar added a comment - looks like the build is normal now. http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/571/ ill let the hudson trunk build run a few more times and then close this jira.
          Hide
          Patrick Hunt added a comment -

          this is still happening, but in another area:

          http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/577/testReport/org.apache.zookeeper.test/AsyncHammerTest/testObserversHammer/

          It is good to have the handler here!

          2009-12-04 13:46:13,493 - ERROR [CommitProcessor:0:CommitProcessor@146] - Unexpected exception causing CommitProcessor to exit
          java.nio.channels.CancelledKeyException
          at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
          at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:64)
          at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:378)
          at org.apache.zookeeper.server.NIOServerCnxn.sendCloseSession(NIOServerCnxn.java:349)
          at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:356)
          at org.apache.zookeeper.server.quorum.Leader$ToBeAppliedRequestProcessor.processRequest(Leader.java:532)
          at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)

          Show
          Patrick Hunt added a comment - this is still happening, but in another area: http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/577/testReport/org.apache.zookeeper.test/AsyncHammerTest/testObserversHammer/ It is good to have the handler here! 2009-12-04 13:46:13,493 - ERROR [CommitProcessor:0:CommitProcessor@146] - Unexpected exception causing CommitProcessor to exit java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:64) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:378) at org.apache.zookeeper.server.NIOServerCnxn.sendCloseSession(NIOServerCnxn.java:349) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:356) at org.apache.zookeeper.server.quorum.Leader$ToBeAppliedRequestProcessor.processRequest(Leader.java:532) at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
          Hide
          Mahadev konar added a comment -

          this patch fixes the 2 latest failuires we saw:

          • the other cancelled key exception is fixed by wrapping sendbuffer with try catch for exceptions.

          i looked for any other calls that can cause commitprocessor to fail in this way and could not find anything else. hopefully this is the last of such problems.

          Show
          Mahadev konar added a comment - this patch fixes the 2 latest failuires we saw: http://bugs.sun.com/view_bug.do?bug_id=6427854 , a bug in the jvm which is fixed by a selector.open.close() call in a static block in NIOServerCnxn (thanks ben for the idea) the other cancelled key exception is fixed by wrapping sendbuffer with try catch for exceptions. i looked for any other calls that can cause commitprocessor to fail in this way and could not find anything else. hopefully this is the last of such problems.
          Hide
          Benjamin Reed added a comment -

          +1 looks great

          Show
          Benjamin Reed added a comment - +1 looks great
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #578 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/578/)
          . ASyncHammerTest is failing intermittently on hudson trunk (take 5) (mahadev)

          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #578 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/578/ ) . ASyncHammerTest is failing intermittently on hudson trunk (take 5) (mahadev)
          Hide
          Mahadev konar added a comment -

          patch for 3.1 release.

          Show
          Mahadev konar added a comment - patch for 3.1 release.
          Hide
          Mahadev konar added a comment -

          patch for 3.2 release.

          Show
          Mahadev konar added a comment - patch for 3.2 release.
          Hide
          Benjamin Reed added a comment -

          +1 for 3.1 and 3.2

          Show
          Benjamin Reed added a comment - +1 for 3.1 and 3.2
          Hide
          Mahadev konar added a comment -

          ant test passes for both 3.1 and 3.2 branch.

          Show
          Mahadev konar added a comment - ant test passes for both 3.1 and 3.2 branch.
          Hide
          Mahadev konar added a comment -

          I just committed this to 3.1 and 3.2 branches.

          Show
          Mahadev konar added a comment - I just committed this to 3.1 and 3.2 branches.

            People

            • Assignee:
              Benjamin Reed
              Reporter:
              Patrick Hunt
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development