ActiveMQ
  1. ActiveMQ
  2. AMQ-1116

deadlock when shutting down client that is configured with failover=true and is presently disconnected from broker

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0.2
    • Fix Version/s: 5.1.0
    • Component/s: Transport
    • Labels:
      None

      Description

      Sounds like a great idea to me Please go ahead and submit a jira with the patch and we'll get it integrated ASAP

      On 1/4/07, Chris Hofstaedter <chrish@nmwco.com> wrote:
      > I ran into this issue as well on <= 4.0.2. I never tested it against
      > 4.1. The FailoverTransport doesn't shutdown if it is disconnected at
      > the time of the shutdown.
      >
      > I believe that the root cause is in the
      > FailoverTransport.oneway(Command
      > command) function which will hold onto the command and keep trying to
      > reconnect until the message is sent or the transport is disposed.
      >
      > I applied a local patch that seems to solve the problem. Note that
      > I've only tested the patch against 4.0.2.
      >
      > Specifically, I early return from the oneway function if the command
      > is ShutdownInfo and the connectedTransport is null. This seems to
      > solve the problem in that I get clean shutdowns on the client when
      > failover is true and the client is presently disconnected.
      >
      > Does this seem like a reasonable patch? Should I create a JIRA with
      > this info?
      >
      >
      >
      > ----Original Message----
      > From: James Strachan james.strachan@gmail.com
      > Sent: Tuesday, December 12, 2006 4:46 AM
      > To: activemq-users@geronimo.apache.org
      > Subject: Re: failover mode and client shutdown
      >
      > There could be a bug relating to closes with the failover transport
      > possibly, but the ActiveMQConnection does wait up to the closeTimeout
      > for a close to succeed before shutting down - so you could try reduce
      > the timeout.
      >
      > http://incubator.apache.org/activemq/maven/activemq-core/apidocs/org/a
      > pa
      > che/activemq/ActiveMQConnection.html#setCloseTimeout(int)
      >
      >
      > On 12/12/06, Keith Irwin <keith.irwin@gmail.com> wrote:
      > > Folks--
      > >
      > > When we have clients running and we take down AMQ (<= 4.1.0), then
      > > attempt to shutdown the clients with Control-C (rather than kill the
      > > JVM with a -9), the clients won't shut down. It's as if a "close"
      > > on the failover connection never reaches the amq client classes.
      > >
      > > I note that in the 4.1.0 release notes, this issue is referenced,
      > > and the advice is to set the maxReconnectAttempts (or similar)
      > > property to something non-zero.
      > >
      > > The problem is that we don't want there to be a max number of
      > > attempts. Unless we specifically want to take down the client (say,
      > > for an apt-get package upgrade), we want it to keep on trying
      > > forever.
      > >
      > > SO, my question: Is there an architectural reason for not being able
      > > to close a failover connection if AMQ is down?
      > >
      > > If it isn't impossible due to tradeoffs elsewhere in the code base,
      > > we might be willing to submit a patch to fix the issue.
      > >
      > > Our only other recourse is to attempt to close the connections in
      > > separate threads, then timeout those threads after a while and fall
      > > out the end of the java process.
      > >
      > > For instance:
      > >
      > > Thread th = new Thread(new Runnable() {
      > > public void run()

      { > > connection.close(); > > }

      > > });
      > > th.start();
      > >
      > > // give up after 2 seconds
      > > Thread.currentThread().join(2000);
      > >
      > > I guess this is do-able, but it seems, you know, some how, well,
      > wrong.
      > >
      > > So, is it worth investigating a patch to AMQ?
      > >
      > > Keith
      > >

      1. FailoverTransport.java
        20 kB
        Chris Hofstaedter
      2. FailoverTransport.java
        27 kB
        Jason Rosenberg
      3. FailoverTransport_AMQ-1116.patch
        3 kB
        Adam Lewandowski
      4. DemandForwardingBridgeSupport.java
        38 kB
        Chris Hofstaedter
      5. ConnectionPool.java
        8 kB
        Jason Rosenberg

        Activity

        Hide
        Cameron Braid added a comment -

        Is there any chance that this could be fixed for 4.1.1 ?

        Show
        Cameron Braid added a comment - Is there any chance that this could be fixed for 4.1.1 ?
        Hide
        Chris Hofstaedter added a comment -

        Cameron's recent comment made me realize that I never attached the files with my local patches in them for this issue. So here they are. These are 4.0.2 files. Also there are patches for multiple issues in here. You can find the blocks of code that patch this issue (for me anyway) by searching for "AMQ-1116 fix"

        Show
        Chris Hofstaedter added a comment - Cameron's recent comment made me realize that I never attached the files with my local patches in them for this issue. So here they are. These are 4.0.2 files. Also there are patches for multiple issues in here. You can find the blocks of code that patch this issue (for me anyway) by searching for " AMQ-1116 fix"
        Hide
        Adam Lewandowski added a comment -

        Patch to prevent ShutdownInfo or RemoveInfo commands from being sent when transport is not connected

        Show
        Adam Lewandowski added a comment - Patch to prevent ShutdownInfo or RemoveInfo commands from being sent when transport is not connected
        Hide
        Adam Lewandowski added a comment -

        Just a confirmation that this is still an issue. I've submitted a patch that does basically the same thing that Chris's does, but only affects FailoverTransport. In addition to ignoring ShutdownInfo commands when not connected, it also short-circuits a reply to RemoveInfo, which was also causing a problem.

        Show
        Adam Lewandowski added a comment - Just a confirmation that this is still an issue. I've submitted a patch that does basically the same thing that Chris's does, but only affects FailoverTransport. In addition to ignoring ShutdownInfo commands when not connected, it also short-circuits a reply to RemoveInfo, which was also causing a problem.
        Hide
        Jason Rosenberg added a comment -

        I can confirm that I am seeing this issue, under the latest 5.1-SNAPSHOT.....

        The record shows the fixed version is 5.2.0....Is that being actively developed now, or is that just a placeholder to indicate this issue is being scheduled for work after the current 5.1 cycle?

        Jason

        Show
        Jason Rosenberg added a comment - I can confirm that I am seeing this issue, under the latest 5.1-SNAPSHOT..... The record shows the fixed version is 5.2.0....Is that being actively developed now, or is that just a placeholder to indicate this issue is being scheduled for work after the current 5.1 cycle? Jason
        Hide
        Jason Rosenberg added a comment -

        Wanted to add, I applied Adam's patch to the latest 5.1-SNAPSHOT version, and it appears to be working great.

        Jason

        Show
        Jason Rosenberg added a comment - Wanted to add, I applied Adam's patch to the latest 5.1-SNAPSHOT version, and it appears to be working great. Jason
        Hide
        Jason Rosenberg added a comment -

        I've attached a patch that I created for several issues, against a recent 5.1-SNAPSHOT revision (svn rev 618082).

        The file FailoverTransport.java, includes fixes for 4 issues, including Adam's fix for this issue.

        The other issues that this patch includes fixes for are:

        AMQ-1116
        AMQ-1575
        AMQ-1577
        AMQ-1581

        The file ConnectionPool.java is also needed for the fix for AMQ-1581

        I've added comments in the code that clearly indicate which sections apply to which issue, so you can easily take edit to include only the fixes you want specifically.

        Jason

        Show
        Jason Rosenberg added a comment - I've attached a patch that I created for several issues, against a recent 5.1-SNAPSHOT revision (svn rev 618082). The file FailoverTransport.java, includes fixes for 4 issues, including Adam's fix for this issue. The other issues that this patch includes fixes for are: AMQ-1116 AMQ-1575 AMQ-1577 AMQ-1581 The file ConnectionPool.java is also needed for the fix for AMQ-1581 I've added comments in the code that clearly indicate which sections apply to which issue, so you can easily take edit to include only the fixes you want specifically. Jason
        Hide
        Rob Davies added a comment -

        Fixed by svn revision 619336

        Show
        Rob Davies added a comment - Fixed by svn revision 619336
        Hide
        Jay Repko added a comment -

        I'm not sure that this is actually fixed. I am currently trying it out with a broker in one VM and a client in another. If I kill the broker forcibly and then try to execute the close method of ActiveMQConnection, I get stuck in the failover transport until the reconnect tries expires after 10 minutes.

        It gets stuck in the close when it calls doSyncSendPacket()
        <code>
        if (isConnectionInfoSentToBroker)

        { // If we announced ourselfs to the broker.. Try to let // the broker // know that the connection is being shutdown. doSyncSendPacket(info.createRemoveCommand(), closeTimeout); doAsyncSendPacket(new ShutdownInfo()); }

        </code>

        This is because another message ack is waiting for the transport to reconnect to be sent and so is holding the writerMutex in MutexTransport.

        Show
        Jay Repko added a comment - I'm not sure that this is actually fixed. I am currently trying it out with a broker in one VM and a client in another. If I kill the broker forcibly and then try to execute the close method of ActiveMQConnection, I get stuck in the failover transport until the reconnect tries expires after 10 minutes. It gets stuck in the close when it calls doSyncSendPacket() <code> if (isConnectionInfoSentToBroker) { // If we announced ourselfs to the broker.. Try to let // the broker // know that the connection is being shutdown. doSyncSendPacket(info.createRemoveCommand(), closeTimeout); doAsyncSendPacket(new ShutdownInfo()); } </code> This is because another message ack is waiting for the transport to reconnect to be sent and so is holding the writerMutex in MutexTransport.
        Hide
        Brett Humphreys added a comment -

        I agree with Jay Repko. I'm using AMQ 5.1.0, I've got the following JUnit test to try to reproduce what I'm experiencing:

          @Test(timeout=15000)
          public void testRetryServerConnectionURLWithNoServer() throws Exception
          {
              final String url="tcp://localhost:61620";
              final String clientUrl = "failover://(" + url + ")?maxReconnectAttempts=0";
              ActiveMQConnectionFactory activeMQConnectionFactory = new ActiveMQConnectionFactory(clientUrl);
              activeMQConnectionFactory.setTransportListener(new PrintingTransportListener());
              final Connection conn = activeMQConnectionFactory.createConnection();
              ((ActiveMQConnection)conn).setCloseTimeout(5);
              Thread clientThread=new Thread("AMQTestClientThread")
              {
                  public void run()
                  {
                      try
                      {
                          conn.start();
                          Thread.sleep(2000);
                          conn.stop();
                          conn.close();
                      }
                      catch(Exception e)
                      {
                          thrownException = e;
                      }
                  }
              };
        
              clientThread.start();
              Thread.sleep(2000);
              System.out.println("Attempting to shutdown");
              conn.close();
              System.out.println("Shutdown, joining.");
              clientThread.join();
              if(thrownException !=null)
              {
                  throw thrownException;
              }
          }
        
        

        This test always times out after 15 seconds, setCloseTimeout doesn't seem to change any behavior.

        Further, when I kill -3 this, I get the following three thread stack traces (The client thread, the reconnecting thread & the stopping thread:

         
        "AMQTestClientThread" prio=1 tid=0x00002aaafc0069f0 nid=0x6f69 in Object.wait() [0x000000004101d000..0x000000004101dc80]
        	at java.lang.Object.wait(Native Method)
        	- waiting on <0x00002aaadf2bc5f8> (a java.lang.Object)
        	at org.apache.activemq.transport.failover.FailoverTransport.oneway(FailoverTransport.java:407)
        	- locked <0x00002aaadf2bc5f8> (a java.lang.Object)
        	at org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:40)
        	- locked <0x00002aaadf2bc158> (a java.lang.Object)
        	at org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:74)
        	at org.apache.activemq.transport.ResponseCorrelator.request(ResponseCorrelator.java:79)
        	at org.apache.activemq.ActiveMQConnection.syncSendPacket(ActiveMQConnection.java:1195)
        	at org.apache.activemq.ActiveMQConnection.ensureConnectionInfoSent(ActiveMQConnection.java:1289)
        	- locked <0x00002aaadf2ab250> (a org.apache.activemq.ActiveMQConnection)
        	at org.apache.activemq.ActiveMQConnection.start(ActiveMQConnection.java:456)
        	at com.kentrox.objectproxy.jms.test.JMSConnectionStateTesting$2.run(JMSConnectionStateTesting.java:235)
        
        "ActiveMQ Task" daemon prio=1 tid=0x00002aaafc02ab80 nid=0x6f68 in Object.wait() [0x0000000040f1c000..0x0000000040f1ce00]
        	at java.lang.Object.wait(Native Method)
        	- waiting on <0x00002aaadf2bc618> (a java.lang.Object)
        	at org.apache.activemq.transport.failover.FailoverTransport.doReconnect(FailoverTransport.java:735)
        	- locked <0x00002aaadf2bc618> (a java.lang.Object)
        	at org.apache.activemq.transport.failover.FailoverTransport$2.iterate(FailoverTransport.java:114)
        	- locked <0x00002aaadf2bc608> (a java.lang.Object)
        	at org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:122)
        	at org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:43)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        	at java.lang.Thread.run(Thread.java:595)
        
        "pool-1-thread-1" prio=1 tid=0x00002aaafc001670 nid=0x6f67 waiting for monitor entry [0x0000000040e1b000..0x0000000040e1bd80]
        	at org.apache.activemq.ActiveMQConnection.checkClosedOrFailed(ActiveMQConnection.java:1258)
        	- waiting to lock <0x00002aaadf2ab250> (a org.apache.activemq.ActiveMQConnection)
        	at org.apache.activemq.ActiveMQConnection.stop(ActiveMQConnection.java:497)
        	at org.apache.activemq.ActiveMQConnection.close(ActiveMQConnection.java:552)
        	at com.kentrox.objectproxy.jms.test.JMSConnectionStateTesting.testRetryServerConnectionURLWithNoServer(JMSConnectionStateTesting.java:250)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:585)
        	at org.junit.internal.runners.TestMethod.invoke(TestMethod.java:59)
        	at org.junit.internal.runners.MethodRoadie.runTestMethod(MethodRoadie.java:98)
        	at org.junit.internal.runners.MethodRoadie$1$1.call(MethodRoadie.java:55)
        	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
        	at java.util.concurrent.FutureTask.run(FutureTask.java:123)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        	at java.lang.Thread.run(Thread.java:595)
        

        This may be updated in 5.2.0 but this issue suggests it is fixed in 5.1.0 and I don't believe that to be the case.

        Show
        Brett Humphreys added a comment - I agree with Jay Repko. I'm using AMQ 5.1.0, I've got the following JUnit test to try to reproduce what I'm experiencing: @Test(timeout=15000) public void testRetryServerConnectionURLWithNoServer() throws Exception { final String url="tcp://localhost:61620"; final String clientUrl = "failover://(" + url + ")?maxReconnectAttempts=0"; ActiveMQConnectionFactory activeMQConnectionFactory = new ActiveMQConnectionFactory(clientUrl); activeMQConnectionFactory.setTransportListener(new PrintingTransportListener()); final Connection conn = activeMQConnectionFactory.createConnection(); ((ActiveMQConnection)conn).setCloseTimeout(5); Thread clientThread=new Thread("AMQTestClientThread") { public void run() { try { conn.start(); Thread.sleep(2000); conn.stop(); conn.close(); } catch(Exception e) { thrownException = e; } } }; clientThread.start(); Thread.sleep(2000); System.out.println("Attempting to shutdown"); conn.close(); System.out.println("Shutdown, joining."); clientThread.join(); if(thrownException !=null) { throw thrownException; } } This test always times out after 15 seconds, setCloseTimeout doesn't seem to change any behavior. Further, when I kill -3 this, I get the following three thread stack traces (The client thread, the reconnecting thread & the stopping thread: "AMQTestClientThread" prio=1 tid=0x00002aaafc0069f0 nid=0x6f69 in Object.wait() [0x000000004101d000..0x000000004101dc80] at java.lang.Object.wait(Native Method) - waiting on <0x00002aaadf2bc5f8> (a java.lang.Object) at org.apache.activemq.transport.failover.FailoverTransport.oneway(FailoverTransport.java:407) - locked <0x00002aaadf2bc5f8> (a java.lang.Object) at org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:40) - locked <0x00002aaadf2bc158> (a java.lang.Object) at org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:74) at org.apache.activemq.transport.ResponseCorrelator.request(ResponseCorrelator.java:79) at org.apache.activemq.ActiveMQConnection.syncSendPacket(ActiveMQConnection.java:1195) at org.apache.activemq.ActiveMQConnection.ensureConnectionInfoSent(ActiveMQConnection.java:1289) - locked <0x00002aaadf2ab250> (a org.apache.activemq.ActiveMQConnection) at org.apache.activemq.ActiveMQConnection.start(ActiveMQConnection.java:456) at com.kentrox.objectproxy.jms.test.JMSConnectionStateTesting$2.run(JMSConnectionStateTesting.java:235) "ActiveMQ Task" daemon prio=1 tid=0x00002aaafc02ab80 nid=0x6f68 in Object.wait() [0x0000000040f1c000..0x0000000040f1ce00] at java.lang.Object.wait(Native Method) - waiting on <0x00002aaadf2bc618> (a java.lang.Object) at org.apache.activemq.transport.failover.FailoverTransport.doReconnect(FailoverTransport.java:735) - locked <0x00002aaadf2bc618> (a java.lang.Object) at org.apache.activemq.transport.failover.FailoverTransport$2.iterate(FailoverTransport.java:114) - locked <0x00002aaadf2bc608> (a java.lang.Object) at org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:122) at org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:43) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675) at java.lang.Thread.run(Thread.java:595) "pool-1-thread-1" prio=1 tid=0x00002aaafc001670 nid=0x6f67 waiting for monitor entry [0x0000000040e1b000..0x0000000040e1bd80] at org.apache.activemq.ActiveMQConnection.checkClosedOrFailed(ActiveMQConnection.java:1258) - waiting to lock <0x00002aaadf2ab250> (a org.apache.activemq.ActiveMQConnection) at org.apache.activemq.ActiveMQConnection.stop(ActiveMQConnection.java:497) at org.apache.activemq.ActiveMQConnection.close(ActiveMQConnection.java:552) at com.kentrox.objectproxy.jms.test.JMSConnectionStateTesting.testRetryServerConnectionURLWithNoServer(JMSConnectionStateTesting.java:250) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.junit.internal.runners.TestMethod.invoke(TestMethod.java:59) at org.junit.internal.runners.MethodRoadie.runTestMethod(MethodRoadie.java:98) at org.junit.internal.runners.MethodRoadie$1$1.call(MethodRoadie.java:55) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269) at java.util.concurrent.FutureTask.run(FutureTask.java:123) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675) at java.lang.Thread.run(Thread.java:595) This may be updated in 5.2.0 but this issue suggests it is fixed in 5.1.0 and I don't believe that to be the case.
        Hide
        Brett Humphreys added a comment -

        Running the same JUnit test as above, 5.2.0 seems to fix this deadlock issue.

        Show
        Brett Humphreys added a comment - Running the same JUnit test as above, 5.2.0 seems to fix this deadlock issue.

          People

          • Assignee:
            Rob Davies
            Reporter:
            Chris Hofstaedter
          • Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development