Uploaded image for project: 'ActiveMQ Classic'
  1. ActiveMQ Classic
  2. AMQ-2595

NFS problems can cause broker to hang on shutdown requiring a hard kill

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 5.3.0
    • 5.4.3
    • Broker
    • None

    Description

      When the ActiveMQ data directory exists on a NFS mounted volume and that volume experiences some type of failure, it can cause the broker to hang and only a manual kill will shut it down. Below is a sequence of errors demonstrating this:

       
      ERROR | Failed to fill batch
      java.io.IOException: I/O error
            at java.io.RandomAccessFile.readBytes(Native Method)
            at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
            at java.io.RandomAccessFile.readFully(RandomAccessFile.java:381)
            at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
            at org.apache.kahadb.journal.DataFileAccessor.readRecord(DataFileAccessor.java:90)
            at org.apache.kahadb.journal.Journal.read(Journal.java:573)
            at org.apache.activemq.store.kahadb.MessageDatabase.load(MessageDatabase.java:656)
            at org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:534)
            at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:229)
            at org.apache.kahadb.page.Transaction.execute(Transaction.java:728)
            at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:222)
            at org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:81)
            at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:227)
            at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:99)
            at org.apache.activemq.broker.region.cursors.StoreQueueCursor.reset(StoreQueueCursor.java:157)
            at org.apache.activemq.broker.region.Queue.doPageIn(Queue.java:1363)
            at org.apache.activemq.broker.region.Queue.pageInMessages(Queue.java:1503)
            at org.apache.activemq.broker.region.Queue.iterate(Queue.java:1178)
            at org.apache.activemq.thread.DeterministicTaskRunner.runTask(DeterministicTaskRunner.java:84)
            at org.apache.activemq.thread.DeterministicTaskRunner$1.run(DeterministicTaskRunner.java:41)
            at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
            at java.lang.Thread.run(Thread.java:619)
      ERROR | Failed to fill batch
      java.lang.RuntimeException: java.io.IOException: I/O error
            at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:230)
            at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:99)
            at org.apache.activemq.broker.region.cursors.StoreQueueCursor.reset(StoreQueueCursor.java:157)
            at org.apache.activemq.broker.region.Queue.doPageIn(Queue.java:1363)
            at org.apache.activemq.broker.region.Queue.pageInMessages(Queue.java:1503)
            at org.apache.activemq.broker.region.Queue.iterate(Queue.java:1178)
            at org.apache.activemq.thread.DeterministicTaskRunner.runTask(DeterministicTaskRunner.java:84)
            at org.apache.activemq.thread.DeterministicTaskRunner$1.run(DeterministicTaskRunner.java:41)
            at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
            at java.lang.Thread.run(Thread.java:619)
      Caused by: java.io.IOException: I/O error
            at java.io.RandomAccessFile.readBytes(Native Method)
            at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
            at java.io.RandomAccessFile.readFully(RandomAccessFile.java:381)
            at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
            at org.apache.kahadb.journal.DataFileAccessor.readRecord(DataFileAccessor.java:90)
            at org.apache.kahadb.journal.Journal.read(Journal.java:573)
            at org.apache.activemq.store.kahadb.MessageDatabase.load(MessageDatabase.java:656)
            at org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:534)
            at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:229)
            at org.apache.kahadb.page.Transaction.execute(Transaction.java:728)
            at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:222)
            at org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:81)
            at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:227)
            ... 10 more
      ERROR | Failed to page in more queue messages
      java.lang.RuntimeException: java.lang.RuntimeException:
      java.io.IOException: I/O error
            at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:102)
            at org.apache.activemq.broker.region.cursors.StoreQueueCursor.reset(StoreQueueCursor.java:157)
            at org.apache.activemq.broker.region.Queue.doPageIn(Queue.java:1363)
            at org.apache.activemq.broker.region.Queue.pageInMessages(Queue.java:1503)
            at org.apache.activemq.broker.region.Queue.iterate(Queue.java:1178)
            at org.apache.activemq.thread.DeterministicTaskRunner.runTask(DeterministicTaskRunner.java:84)
            at org.apache.activemq.thread.DeterministicTaskRunner$1.run(DeterministicTaskRunner.java:41)
            at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
            at java.lang.Thread.run(Thread.java:619)
      Caused by: java.lang.RuntimeException: java.io.IOException: I/O error
            at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:230)
            at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:99)
            ... 9 more
      Caused by: java.io.IOException: I/O error
            at java.io.RandomAccessFile.readBytes(Native Method)
            at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
            at java.io.RandomAccessFile.readFully(RandomAccessFile.java:381)
            at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
            at org.apache.kahadb.journal.DataFileAccessor.readRecord(DataFileAccessor.java:90)
            at org.apache.kahadb.journal.Journal.read(Journal.java:573)
            at org.apache.activemq.store.kahadb.MessageDatabase.load(MessageDatabase.java:656)
            at org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:534)
            at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:229)
            at org.apache.kahadb.page.Transaction.execute(Transaction.java:728)
            at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:222)
            at org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:81)
            at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:227)
            ... 10 more
      java.io.IOException: I/O error
            at java.io.RandomAccessFile.write(Native Method)
            at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:967)
            at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:974)
            at org.apache.kahadb.page.PageFile.flush(PageFile.java:443)
            at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:947)
            at org.apache.activemq.store.kahadb.MessageDatabase.access$200(MessageDatabase.java:79)
            at org.apache.activemq.store.kahadb.MessageDatabase$9.execute(MessageDatabase.java:587)
            at org.apache.kahadb.page.Transaction.execute(Transaction.java:728)
            at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:585)
            at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:256)
      Exception in thread "ActiveMQ Journal Checkpoint Worker"
      java.lang.NullPointerException
            at java.util.zip.Adler32.update(Adler32.java:44)
            at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:962)
            at org.apache.kahadb.page.PageFile.flush(PageFile.java:443)
            at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:947)
            at org.apache.activemq.store.kahadb.MessageDatabase.access$200(MessageDatabase.java:79)
            at org.apache.activemq.store.kahadb.MessageDatabase$9.execute(MessageDatabase.java:587)
            at org.apache.kahadb.page.Transaction.execute(Transaction.java:728)
            at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:585)
            at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:256)
      ^C INFO | ActiveMQ Message Broker (dynamic-broker1,
      ID:tssol10s-32974-1265131581698-0:0) is shutting down
      INFO | Connector vm://dynamic-broker1 Stopped
      INFO | dynamic-broker1 bridge to dynamic-broker2 stopped
      INFO | Network Connector localhost Stopped
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
      

      In this case, a problem occurred with the NFS mount point where the data directory lives which resulted in the I/O error. After sending a ctrl-c to shut down the broker, it would not shut down on it's own so it had to be killed. Only then could failover take place allowing another broker could grab the lock.

      Given that the broker cannot recover from such a problem, it should probably force itself to shut down so that failover can take place. Also, such a failure should be a FATAL and not an ERROR in the logging.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bsnyder Bruce Snyder
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: