ActiveMQ
  1. ActiveMQ
  2. AMQ-3725

Kahadb error during SAN failover delayed write - Allow kahaDB to recover in a similar manner as the JDBC store using the IOExceptionHandler

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.5.1
    • Fix Version/s: 5.9.1, 5.10.0
    • Component/s: Message Store
    • Labels:
      None

      Description

      An issue can arise that causes the broker to terminate when using kahaDB with a SAN, when the SAN fails over. In this case the failover process is seamless however, on fail back there is a 2-3 sec delay where writes are blocked and the broker terminates. With the JDBC datastore a similar situation can be handled by using the IOExceptionHandler. However with kahaDB, when this same IOExceptionHandler is added it prevents the broker from terminating but kahaDB retains an invalid index.

       INFO | ActiveMQ JMS Message Broker (Broker1, ID:macbookpro-251a.home-56915-1328715089252-0:1) started
       INFO | jetty-7.1.6.v20100715
       INFO | ActiveMQ WebConsole initialized.
       INFO | Initializing Spring FrameworkServlet 'dispatcher'
       INFO | ActiveMQ Console at http://0.0.0.0:8161/admin
       INFO | ActiveMQ Web Demos at http://0.0.0.0:8161/demo
       INFO | RESTful file access application at http://0.0.0.0:8161/fileserver
       INFO | FUSE Web Console at http://0.0.0.0:8161/console
       INFO | Started SelectChannelConnector@0.0.0.0:8161
      ERROR | KahaDB failed to store to Journal
      java.io.SyncFailedException: sync failed
      	at java.io.FileDescriptor.sync(Native Method)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:382)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
       INFO | Ignoring IO exception, java.io.SyncFailedException: sync failed
      java.io.SyncFailedException: sync failed
      	at java.io.FileDescriptor.sync(Native Method)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:382)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
      ERROR | Checkpoint failed
      java.io.SyncFailedException: sync failed
      	at java.io.FileDescriptor.sync(Native Method)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:382)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
       INFO | Ignoring IO exception, java.io.SyncFailedException: sync failed
      java.io.SyncFailedException: sync failed
      	at java.io.FileDescriptor.sync(Native Method)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:382)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
      ERROR | KahaDB failed to store to Journal
      java.io.FileNotFoundException: /Volumes/NAS-01/data/kahadb/db-1.log (No such file or directory)
      	at java.io.RandomAccessFile.open(Native Method)
      	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
      	at org.apache.kahadb.journal.DataFile.openRandomAccessFile(DataFile.java:70)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:324)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
       INFO | Ignoring IO exception, java.io.FileNotFoundException: /Volumes/NAS-01/data/kahadb/db-1.log (No such file or directory)
      java.io.FileNotFoundException: /Volumes/NAS-01/data/kahadb/db-1.log (No such file or directory)
      	at java.io.RandomAccessFile.open(Native Method)
      	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
      	at org.apache.kahadb.journal.DataFile.openRandomAccessFile(DataFile.java:70)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:324)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
      ERROR | KahaDB failed to store to Journal
      java.io.FileNotFoundException: /Volumes/NAS-01/data/kahadb/db-1.log (No such file or directory)
      	at java.io.RandomAccessFile.open(Native Method)
      	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
      	at org.apache.kahadb.journal.DataFile.openRandomAccessFile(DataFile.java:70)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:324)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
       INFO | Ignoring IO exception, java.io.FileNotFoundException: /Volumes/NAS-01/data/kahadb/db-1.log (No such file or directory)
      java.io.FileNotFoundException: /Volumes/NAS-01/data/kahadb/db-1.log (No such file or directory)
      	at java.io.RandomAccessFile.open(Native Method)
      	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
      	at org.apache.kahadb.journal.DataFile.openRandomAccessFile(DataFile.java:70)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:324)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
       WARN | Transport failed: java.io.EOFException
       WARN | Transport failed: java.io.EOFException
       INFO | KahaDB: Recovering checkpoint thread after death
      ERROR | Checkpoint failed
      java.io.IOException: Input/output error
      	at java.io.RandomAccessFile.write(Native Method)
      	at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1001)
      	at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:1006)
      	at org.apache.kahadb.page.PageFile.flush(PageFile.java:484)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1290)
      	at org.apache.activemq.store.kahadb.MessageDatabase$10.execute(MessageDatabase.java:768)
      	at org.apache.kahadb.page.Transaction.execute(Transaction.java:760)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:766)
      	at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:315)
       INFO | Ignoring IO exception, java.io.IOException: Input/output error
      java.io.IOException: Input/output error
      	at java.io.RandomAccessFile.write(Native Method)
      	at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1001)
      	at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:1006)
      	at org.apache.kahadb.page.PageFile.flush(PageFile.java:484)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1290)
      	at org.apache.activemq.store.kahadb.MessageDatabase$10.execute(MessageDatabase.java:768)
      	at org.apache.kahadb.page.Transaction.execute(Transaction.java:760)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:766)
      	at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:315)
       INFO | KahaDB: Recovering checkpoint thread after death
      ERROR | Checkpoint failed
      java.io.IOException: Input/output error
      	at java.io.RandomAccessFile.write(Native Method)
      	at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1001)
      	at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:1006)
      	at org.apache.kahadb.page.PageFile.flush(PageFile.java:484)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1290)
      	at org.apache.activemq.store.kahadb.MessageDatabase$10.execute(MessageDatabase.java:768)
      	at org.apache.kahadb.page.Transaction.execute(Transaction.java:760)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:766)
      	at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:315)
       INFO | Ignoring IO exception, java.io.IOException: Input/output error
      java.io.IOException: Input/output error
      	at java.io.RandomAccessFile.write(Native Method)
      	at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1001)
      	at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:1006)
      	at org.apache.kahadb.page.PageFile.flush(PageFile.java:484)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1290)
      	at org.apache.activemq.store.kahadb.MessageDatabase$10.execute(MessageDatabase.java:768)
      	at org.apache.kahadb.page.Transaction.execute(Transaction.java:760)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:766)
      	at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:315)
       WARN | Transport failed: java.io.EOFException
      

        Activity

        Jason Sherman created issue -
        Hide
        SuoNayi added a comment -

        Jason,we are going to set up master-slave brokers with SAN but get into the trouble of synchronization.
        We mount SAN to the same path directory on two servers while operations such as creating or deleting files in the shared directory seems not synchronous.
        One server cannot see changes made by the other if not remounting manually.
        I know that even NFS can work well but it's not recommended to be the shared storage with AMQ.
        Can you share some experience about setting up the shared storage with SAN?
        Thanks.

        Show
        SuoNayi added a comment - Jason,we are going to set up master-slave brokers with SAN but get into the trouble of synchronization. We mount SAN to the same path directory on two servers while operations such as creating or deleting files in the shared directory seems not synchronous. One server cannot see changes made by the other if not remounting manually. I know that even NFS can work well but it's not recommended to be the shared storage with AMQ. Can you share some experience about setting up the shared storage with SAN? Thanks.
        Rob Davies made changes -
        Field Original Value New Value
        Fix Version/s 5.6.1 [ 12320766 ]
        Fix Version/s 5.x [ 12316331 ]
        Fix Version/s 5.6.0 [ 12317974 ]
        Dejan Bosanac made changes -
        Fix Version/s 5.7.0 [ 12321258 ]
        Fix Version/s 5.6.1 [ 12320766 ]
        Hide
        Sree Panchajanyam D added a comment -

        We have implemented a GFS based solution with i/o fencing (also called disk fencing). I/O fencing is used to avoid this "split brain" scenario where both active and passive servers write to the disk.

        "One server cannot see changes made by the other if not remounting manually." . Could you please make this sentence more clear?

        Show
        Sree Panchajanyam D added a comment - We have implemented a GFS based solution with i/o fencing (also called disk fencing). I/O fencing is used to avoid this "split brain" scenario where both active and passive servers write to the disk. "One server cannot see changes made by the other if not remounting manually." . Could you please make this sentence more clear?
        Claus Ibsen made changes -
        Fix Version/s 5.8.0 [ 12323282 ]
        Fix Version/s 5.7.0 [ 12321258 ]
        Gary Tully made changes -
        Fix Version/s 5.9.0 [ 12323932 ]
        Fix Version/s 5.8.0 [ 12323282 ]
        Hide
        Rob Waite added a comment -

        We ran into a similar issue but we a missing some of the exceptions shown in the description. We are running 5.7.0

         
        KahaDB failed to store to Journal | org.apache.activemq.store.kahadb.MessageDatabase | ConcurrentQueueStoreAndDispatch
        java.io.SyncFailedException: sync failed
                at java.io.FileDescriptor.sync(Native Method)
                at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:367)
                at org.apache.kahadb.journal.DataFileAppender$1.run(DataFileAppender.java:188)
        2013-02-13 17:07:57,427 | INFO  | Stopping the broker due to exception, java.io.SyncFailedException: sync failed | org.apache.activemq.util.DefaultIOExceptionHandler | ConcurrentQueueStoreAndDispatch
        java.io.SyncFailedException: sync failed
                at java.io.FileDescriptor.sync(Native Method)
                at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:367)
                at org.apache.kahadb.journal.DataFileAppender$1.run(DataFileAppender.java:188)
        2013-02-13 17:07:57,517 | INFO  | Apache ActiveMQ 5.7.0 (habroker, ID:HOSTNAME-35138-1360038011880-0:1) is shutting down | org.apache.activemq.broker.BrokerService | Stopping the broker due to IO exception
        
        

        After this, AMQ immediately went into shutdown mode. We did not see the FileNotFoundException, messages about checkpoint recovery with the IOException or the transport failures from EOFException. We did see the transport failures on the clients. The Java client also had an EOFException. Our C# client which is using NMS and gave the message "Unable to read beyond the end of the stream".

        When we restarted AMQ, it would immediately shut down due to the journal files missing. We set the parameter on KahaDB in activemq.xml ignoreMissingJournalfiles=true. This time AMQ came up and said that it lost 33 messages.

        We then saw a number of errors (they actually show up a number of times with maybe three variations almost like multiple loggers are kicking off):

        2013-02-13 17:46:14,892 | ERROR | Problem retrieving message for browse | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[habroker] Scheduler
        java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: Invalid location: 131:3886117, : java.lang.NegativeArraySizeException
        	at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:113)
        	at org.apache.activemq.broker.region.cursors.StoreQueueCursor.reset(StoreQueueCursor.java:157)
        	at org.apache.activemq.broker.region.Queue.doBrowse(Queue.java:1066)
        	at org.apache.activemq.broker.region.Queue.expireMessages(Queue.java:832)
        	at org.apache.activemq.broker.region.Queue.access$100(Queue.java:98)
        	at org.apache.activemq.broker.region.Queue$2.run(Queue.java:138)
        	at org.apache.activemq.thread.SchedulerTimerTask.run(SchedulerTimerTask.java:33)
        	at java.util.TimerThread.mainLoop(Timer.java:534)
        	at java.util.TimerThread.run(Timer.java:484)
        Caused by: java.lang.RuntimeException: java.io.IOException: Invalid location: 131:3886117, : java.lang.NegativeArraySizeException
        	at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:277)
        	at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:110)
        	... 8 more
        Caused by: java.io.IOException: Invalid location: 131:3886117, : java.lang.NegativeArraySizeException
        	at org.apache.kahadb.journal.DataFileAccessor.readRecord(DataFileAccessor.java:92)
        	at org.apache.kahadb.journal.Journal.read(Journal.java:604)
        	at org.apache.activemq.store.kahadb.MessageDatabase.load(MessageDatabase.java:879)
        	at org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:1030)
        	at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:558)
        	at org.apache.kahadb.page.Transaction.execute(Transaction.java:769)
        	at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:547)
        	at org.apache.activemq.store.ProxyMessageStore.recoverNextMessages(ProxyMessageStore.java:106)
        	at org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:97)
        	at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:274)
        	... 9 more
        

        This all happens during an 8 second period. Then... about 13 minutes later we see some more:

        2013-02-13 17:59:55,562 | ERROR | Failed to page in more queue messages  | org.apache.activemq.broker.region.Queue | ActiveMQ BrokerService[habroker] Task-38
        java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: Invalid location: 125:11971180, : java.lang.NegativeArraySizeException
        	at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.hasNext(AbstractStoreCursor.java:150)
        	at org.apache.activemq.broker.region.cursors.StoreQueueCursor.hasNext(StoreQueueCursor.java:131)
        	at org.apache.activemq.broker.region.Queue.doPageInForDispatch(Queue.java:1766)
        	at org.apache.activemq.broker.region.Queue.pageInMessages(Queue.java:1993)
        	at org.apache.activemq.broker.region.Queue.iterate(Queue.java:1486)
        	at org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:129)
        	at org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:47)
        	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        	at java.lang.Thread.run(Thread.java:679)
        Caused by: java.lang.RuntimeException: java.io.IOException: Invalid location: 125:11971180, : java.lang.NegativeArraySizeException
        	at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:277)
        	at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.hasNext(AbstractStoreCursor.java:147)
        	... 9 more
        Caused by: java.io.IOException: Invalid location: 125:11971180, : java.lang.NegativeArraySizeException
        	at org.apache.kahadb.journal.DataFileAccessor.readRecord(DataFileAccessor.java:92)
        	at org.apache.kahadb.journal.Journal.read(Journal.java:604)
        	at org.apache.activemq.store.kahadb.MessageDatabase.load(MessageDatabase.java:879)
        	at org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:1030)
        	at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:558)
        	at org.apache.kahadb.page.Transaction.execute(Transaction.java:769)
        	at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:547)
        	at org.apache.activemq.store.ProxyMessageStore.recoverNextMessages(ProxyMessageStore.java:106)
        	at org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:97)
        	at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:274)
        	... 10 more
        

        We haven't seen any more errors since the one above about 6 hours ago. I just pushed 1000 messages through and there were no errors, things seem normal.

        We are not currently sure what may have happened to our SAN or connectivity during this time. We are still analyzing. Will attempt to recreate to understand the best way to mitigate this.

        Show
        Rob Waite added a comment - We ran into a similar issue but we a missing some of the exceptions shown in the description. We are running 5.7.0 KahaDB failed to store to Journal | org.apache.activemq.store.kahadb.MessageDatabase | ConcurrentQueueStoreAndDispatch java.io.SyncFailedException: sync failed at java.io.FileDescriptor.sync(Native Method) at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:367) at org.apache.kahadb.journal.DataFileAppender$1.run(DataFileAppender.java:188) 2013-02-13 17:07:57,427 | INFO | Stopping the broker due to exception, java.io.SyncFailedException: sync failed | org.apache.activemq.util.DefaultIOExceptionHandler | ConcurrentQueueStoreAndDispatch java.io.SyncFailedException: sync failed at java.io.FileDescriptor.sync(Native Method) at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:367) at org.apache.kahadb.journal.DataFileAppender$1.run(DataFileAppender.java:188) 2013-02-13 17:07:57,517 | INFO | Apache ActiveMQ 5.7.0 (habroker, ID:HOSTNAME-35138-1360038011880-0:1) is shutting down | org.apache.activemq.broker.BrokerService | Stopping the broker due to IO exception After this, AMQ immediately went into shutdown mode. We did not see the FileNotFoundException, messages about checkpoint recovery with the IOException or the transport failures from EOFException. We did see the transport failures on the clients. The Java client also had an EOFException. Our C# client which is using NMS and gave the message "Unable to read beyond the end of the stream". When we restarted AMQ, it would immediately shut down due to the journal files missing. We set the parameter on KahaDB in activemq.xml ignoreMissingJournalfiles=true. This time AMQ came up and said that it lost 33 messages. We then saw a number of errors (they actually show up a number of times with maybe three variations almost like multiple loggers are kicking off): 2013-02-13 17:46:14,892 | ERROR | Problem retrieving message for browse | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[habroker] Scheduler java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: Invalid location: 131:3886117, : java.lang.NegativeArraySizeException at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:113) at org.apache.activemq.broker.region.cursors.StoreQueueCursor.reset(StoreQueueCursor.java:157) at org.apache.activemq.broker.region.Queue.doBrowse(Queue.java:1066) at org.apache.activemq.broker.region.Queue.expireMessages(Queue.java:832) at org.apache.activemq.broker.region.Queue.access$100(Queue.java:98) at org.apache.activemq.broker.region.Queue$2.run(Queue.java:138) at org.apache.activemq.thread.SchedulerTimerTask.run(SchedulerTimerTask.java:33) at java.util.TimerThread.mainLoop(Timer.java:534) at java.util.TimerThread.run(Timer.java:484) Caused by: java.lang.RuntimeException: java.io.IOException: Invalid location: 131:3886117, : java.lang.NegativeArraySizeException at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:277) at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:110) ... 8 more Caused by: java.io.IOException: Invalid location: 131:3886117, : java.lang.NegativeArraySizeException at org.apache.kahadb.journal.DataFileAccessor.readRecord(DataFileAccessor.java:92) at org.apache.kahadb.journal.Journal.read(Journal.java:604) at org.apache.activemq.store.kahadb.MessageDatabase.load(MessageDatabase.java:879) at org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:1030) at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:558) at org.apache.kahadb.page.Transaction.execute(Transaction.java:769) at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:547) at org.apache.activemq.store.ProxyMessageStore.recoverNextMessages(ProxyMessageStore.java:106) at org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:97) at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:274) ... 9 more This all happens during an 8 second period. Then... about 13 minutes later we see some more: 2013-02-13 17:59:55,562 | ERROR | Failed to page in more queue messages | org.apache.activemq.broker.region.Queue | ActiveMQ BrokerService[habroker] Task-38 java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: Invalid location: 125:11971180, : java.lang.NegativeArraySizeException at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.hasNext(AbstractStoreCursor.java:150) at org.apache.activemq.broker.region.cursors.StoreQueueCursor.hasNext(StoreQueueCursor.java:131) at org.apache.activemq.broker.region.Queue.doPageInForDispatch(Queue.java:1766) at org.apache.activemq.broker.region.Queue.pageInMessages(Queue.java:1993) at org.apache.activemq.broker.region.Queue.iterate(Queue.java:1486) at org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:129) at org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:47) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) Caused by: java.lang.RuntimeException: java.io.IOException: Invalid location: 125:11971180, : java.lang.NegativeArraySizeException at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:277) at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.hasNext(AbstractStoreCursor.java:147) ... 9 more Caused by: java.io.IOException: Invalid location: 125:11971180, : java.lang.NegativeArraySizeException at org.apache.kahadb.journal.DataFileAccessor.readRecord(DataFileAccessor.java:92) at org.apache.kahadb.journal.Journal.read(Journal.java:604) at org.apache.activemq.store.kahadb.MessageDatabase.load(MessageDatabase.java:879) at org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:1030) at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:558) at org.apache.kahadb.page.Transaction.execute(Transaction.java:769) at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:547) at org.apache.activemq.store.ProxyMessageStore.recoverNextMessages(ProxyMessageStore.java:106) at org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:97) at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:274) ... 10 more We haven't seen any more errors since the one above about 6 hours ago. I just pushed 1000 messages through and there were no errors, things seem normal. We are not currently sure what may have happened to our SAN or connectivity during this time. We are still analyzing. Will attempt to recreate to understand the best way to mitigate this.
        Hide
        Rob Waite added a comment -

        I was looking at the source for JDBCIOExceptionHandler and it calls setStopStartConnectors(true) from the base class DefaultIOExceptionHandler.

        It seems like the DefaultIOExceptionHandler already has code that would attempt to restart all the connectors. Don't know if this would resolve the issue but seems like we might want to either set stopStartConnectors to true by default or allow it to get set in activemq.xml.

        Haven't pulled the repo yet... hopefully the history of this file doesn't indicate that setting this to true is buggy..

        Show
        Rob Waite added a comment - I was looking at the source for JDBCIOExceptionHandler and it calls setStopStartConnectors(true) from the base class DefaultIOExceptionHandler. It seems like the DefaultIOExceptionHandler already has code that would attempt to restart all the connectors. Don't know if this would resolve the issue but seems like we might want to either set stopStartConnectors to true by default or allow it to get set in activemq.xml. Haven't pulled the repo yet... hopefully the history of this file doesn't indicate that setting this to true is buggy..
        Hide
        Rob Waite added a comment -

        We made a build of 5.7.0 with the stopStartConnectors flag set to true in DefaultIOExceptionHandler. Locally, I used a thumb drive as the datadirectory and would get producers and consumers working and then yank it. We reproduced a stack trace a bit more similar to above but still a bit different.

        2013-02-14 11:16:16,032 | ERROR | KahaDB failed to store to Journal | org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Transport: tcp:///0:0:0:0:0:0:0:1:54836@61616
        java.io.SyncFailedException: sync failed
                        at java.io.FileDescriptor.sync(Native Method)
                        at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:367)
                        at org.apache.kahadb.journal.DataFileAppender$1.run(DataFileAppender.java:188)
        2013-02-14 11:16:16,035 | INFO  | Initiating stop/restart of broker transport due to IO exception, java.io.SyncFailedException: sync failed | org.apache.activemq.util.DefaultIOExceptionHandler | ActiveMQ Transport: tcp:///0:0:0:0:0:0:0:1:54836@61616
        java.io.SyncFailedException: sync failed
                        at java.io.FileDescriptor.sync(Native Method)
                        at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:367)
                        at org.apache.kahadb.journal.DataFileAppender$1.run(DataFileAppender.java:188)
        2013-02-14 11:16:16,041 | INFO  | waiting for broker persistence adapter checkpoint to succeed before restarting transports | org.apache.activemq.util.DefaultIOExceptionHandler | restart transport connectors post IO exception
        2013-02-14 11:16:16,062 | WARN  | Transport Connection to: tcp://0:0:0:0:0:0:0:1:54836 failed: java.io.EOFException | org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ Transport: tcp:///0:0:0:0:0:0:0:1:54836@61616
        2013-02-14 11:16:16,268 | ERROR | KahaDB failed to store to Journal | org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Transport: tcp:///0:0:0:0:0:0:0:1:54848@61616
        java.io.IOException: The volume for a file has been externally altered so that the opened file is no longer valid
                        at java.io.RandomAccessFile.readBytes(Native Method)
                        at java.io.RandomAccessFile.read(RandomAccessFile.java:338)
                        at java.io.RandomAccessFile.readFully(RandomAccessFile.java:397)
                        at java.io.RandomAccessFile.readFully(RandomAccessFile.java:377)
                        at org.apache.kahadb.page.PageFile.readPage(PageFile.java:876)
        
        

        Then we got a loop trying to get a checkpoint before restarting transports:

        2013-02-14 11:16:21,046 | INFO  | KahaDB: Recovering checkpoint thread after death | org.apache.activemq.store.kahadb.MessageDatabase | restart transport connectors post IO exception
        2013-02-14 11:16:21,049 | INFO  | waiting for broker persistence adapter checkpoint to succeed before restarting transports | org.apache.activemq.util.DefaultIOExceptionHandler | restart transport connectors post IO exception
        2013-02-14 11:16:26,055 | ERROR | Checkpoint failed | org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal Checkpoint Worker
        java.io.IOException: The volume for a file has been externally altered so that the opened file is no longer valid
                        at java.io.RandomAccessFile.write(Native Method)
                        at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1001)
                        at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:1077)
                        at org.apache.kahadb.page.PageFile.flush(PageFile.java:515)
                        at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1313)
                        at org.apache.activemq.store.kahadb.MessageDatabase$10.execute(MessageDatabase.java:769)
                        at org.apache.kahadb.page.Transaction.execute(Transaction.java:769)
                        at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:767)
                        at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:323)
        2013-02-14 11:16:26,064 | INFO  | waiting for broker persistence adapter checkpoint to succeed before restarting transports | org.apache.activemq.util.DefaultIOExceptionHandler | restart transport connectors post IO exception
        2013-02-14 11:16:31,069 | INFO  | KahaDB: Recovering checkpoint thread after death | org.apache.activemq.store.kahadb.MessageDatabase | restart transport connectors post IO exception
        
        

        So it would not automatically come back... I then restarted AMQ and got the following:

        2013-02-14 11:33:46,398 | INFO  | Using Persistence Adapter: KahaDBPersistenceAdapter[E:\kahadb] | org.apache.activemq.broker.BrokerService | main
        2013-02-14 11:33:47,892 | INFO  | Corrupt journal records found in 'E:\kahadb\db-2.log' between offsets: 5576149..6343375 | org.apache.kahadb.journal.Journal | main
        2013-02-14 11:33:47,905 | INFO  | Corrupt journal records found in 'E:\kahadb\db-2.log' between offsets: 6351482..6356885 | org.apache.kahadb.journal.Journal | main
        2013-02-14 11:33:56,253 | INFO  | KahaDB is version 4 | org.apache.activemq.store.kahadb.MessageDatabase | main
        2013-02-14 11:33:56,373 | INFO  | Recovering from the journal ... | org.apache.activemq.store.kahadb.MessageDatabase | main
        2013-02-14 11:33:56,643 | INFO  | Recovery replayed 478 operations from the journal in 0.292 seconds. | org.apache.activemq.store.kahadb.MessageDatabase | main
        2013-02-14 11:33:56,702 | INFO  | Scheduler using directory: E:\habroker\scheduler | org.apache.activemq.broker.scheduler.SchedulerBroker | main
        2013-02-14 11:33:56,726 | INFO  | Apache ActiveMQ 5.7.0 (habroker, ID:kpc-eng-d27-55891-1360870436389-0:1) is starting | org.apache.activemq.broker.BrokerService | main
        2013-02-14 11:33:56,748 | INFO  | ignoring zero length, partially initialised journal data file: db-1.log number = 1 , length = 0 | org.apache.kahadb.journal.Journal | main
        2013-02-14 11:33:57,343 | INFO  | JobSchedulerStore:E:\habroker\scheduler started | org.apache.activemq.broker.scheduler.JobSchedulerStore | main
        

        This is a bit different than what we saw in our QA environment where it said it lost 33 messages.

        I looked in the history of the file and in 5.5.1, the stopStartConnectors is still false. Not really sure what kicked off the retry in 5.5.1 vs. the broker stop in 5.7.0.

        Show
        Rob Waite added a comment - We made a build of 5.7.0 with the stopStartConnectors flag set to true in DefaultIOExceptionHandler. Locally, I used a thumb drive as the datadirectory and would get producers and consumers working and then yank it. We reproduced a stack trace a bit more similar to above but still a bit different. 2013-02-14 11:16:16,032 | ERROR | KahaDB failed to store to Journal | org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Transport: tcp:///0:0:0:0:0:0:0:1:54836@61616 java.io.SyncFailedException: sync failed at java.io.FileDescriptor.sync(Native Method) at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:367) at org.apache.kahadb.journal.DataFileAppender$1.run(DataFileAppender.java:188) 2013-02-14 11:16:16,035 | INFO | Initiating stop/restart of broker transport due to IO exception, java.io.SyncFailedException: sync failed | org.apache.activemq.util.DefaultIOExceptionHandler | ActiveMQ Transport: tcp:///0:0:0:0:0:0:0:1:54836@61616 java.io.SyncFailedException: sync failed at java.io.FileDescriptor.sync(Native Method) at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:367) at org.apache.kahadb.journal.DataFileAppender$1.run(DataFileAppender.java:188) 2013-02-14 11:16:16,041 | INFO | waiting for broker persistence adapter checkpoint to succeed before restarting transports | org.apache.activemq.util.DefaultIOExceptionHandler | restart transport connectors post IO exception 2013-02-14 11:16:16,062 | WARN | Transport Connection to: tcp://0:0:0:0:0:0:0:1:54836 failed: java.io.EOFException | org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ Transport: tcp:///0:0:0:0:0:0:0:1:54836@61616 2013-02-14 11:16:16,268 | ERROR | KahaDB failed to store to Journal | org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Transport: tcp:///0:0:0:0:0:0:0:1:54848@61616 java.io.IOException: The volume for a file has been externally altered so that the opened file is no longer valid at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:338) at java.io.RandomAccessFile.readFully(RandomAccessFile.java:397) at java.io.RandomAccessFile.readFully(RandomAccessFile.java:377) at org.apache.kahadb.page.PageFile.readPage(PageFile.java:876) Then we got a loop trying to get a checkpoint before restarting transports: 2013-02-14 11:16:21,046 | INFO | KahaDB: Recovering checkpoint thread after death | org.apache.activemq.store.kahadb.MessageDatabase | restart transport connectors post IO exception 2013-02-14 11:16:21,049 | INFO | waiting for broker persistence adapter checkpoint to succeed before restarting transports | org.apache.activemq.util.DefaultIOExceptionHandler | restart transport connectors post IO exception 2013-02-14 11:16:26,055 | ERROR | Checkpoint failed | org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal Checkpoint Worker java.io.IOException: The volume for a file has been externally altered so that the opened file is no longer valid at java.io.RandomAccessFile.write(Native Method) at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1001) at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:1077) at org.apache.kahadb.page.PageFile.flush(PageFile.java:515) at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1313) at org.apache.activemq.store.kahadb.MessageDatabase$10.execute(MessageDatabase.java:769) at org.apache.kahadb.page.Transaction.execute(Transaction.java:769) at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:767) at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:323) 2013-02-14 11:16:26,064 | INFO | waiting for broker persistence adapter checkpoint to succeed before restarting transports | org.apache.activemq.util.DefaultIOExceptionHandler | restart transport connectors post IO exception 2013-02-14 11:16:31,069 | INFO | KahaDB: Recovering checkpoint thread after death | org.apache.activemq.store.kahadb.MessageDatabase | restart transport connectors post IO exception So it would not automatically come back... I then restarted AMQ and got the following: 2013-02-14 11:33:46,398 | INFO | Using Persistence Adapter: KahaDBPersistenceAdapter[E:\kahadb] | org.apache.activemq.broker.BrokerService | main 2013-02-14 11:33:47,892 | INFO | Corrupt journal records found in 'E:\kahadb\db-2.log' between offsets: 5576149..6343375 | org.apache.kahadb.journal.Journal | main 2013-02-14 11:33:47,905 | INFO | Corrupt journal records found in 'E:\kahadb\db-2.log' between offsets: 6351482..6356885 | org.apache.kahadb.journal.Journal | main 2013-02-14 11:33:56,253 | INFO | KahaDB is version 4 | org.apache.activemq.store.kahadb.MessageDatabase | main 2013-02-14 11:33:56,373 | INFO | Recovering from the journal ... | org.apache.activemq.store.kahadb.MessageDatabase | main 2013-02-14 11:33:56,643 | INFO | Recovery replayed 478 operations from the journal in 0.292 seconds. | org.apache.activemq.store.kahadb.MessageDatabase | main 2013-02-14 11:33:56,702 | INFO | Scheduler using directory: E:\habroker\scheduler | org.apache.activemq.broker.scheduler.SchedulerBroker | main 2013-02-14 11:33:56,726 | INFO | Apache ActiveMQ 5.7.0 (habroker, ID:kpc-eng-d27-55891-1360870436389-0:1) is starting | org.apache.activemq.broker.BrokerService | main 2013-02-14 11:33:56,748 | INFO | ignoring zero length, partially initialised journal data file: db-1.log number = 1 , length = 0 | org.apache.kahadb.journal.Journal | main 2013-02-14 11:33:57,343 | INFO | JobSchedulerStore:E:\habroker\scheduler started | org.apache.activemq.broker.scheduler.JobSchedulerStore | main This is a bit different than what we saw in our QA environment where it said it lost 33 messages. I looked in the history of the file and in 5.5.1, the stopStartConnectors is still false. Not really sure what kicked off the retry in 5.5.1 vs. the broker stop in 5.7.0.
        Hide
        Rob Waite added a comment -

        Also... today which is about 18 hours later... we still do see some errors popping in our AMQ instance that seems related to this issue. Perhaps our data store is still corrupted/odd and we only see it for certain queues. Things in general do seem to still be working but I don't trust the state that it is in:

        2013-02-14 11:02:39,710 | ERROR | Failed to page in more queue messages  | org.apache.activemq.broker.region.Queue | ActiveMQ BrokerService[habroker] Task-789
        java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: Invalid location: 129:8716740, : java.lang.NegativeArraySizeException
                        at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.hasNext(AbstractStoreCursor.java:150)
                        at org.apache.activemq.broker.region.cursors.StoreQueueCursor.hasNext(StoreQueueCursor.java:131)
                        at org.apache.activemq.broker.region.Queue.doPageInForDispatch(Queue.java:1766)
                        at org.apache.activemq.broker.region.Queue.pageInMessages(Queue.java:1993)
                        at org.apache.activemq.broker.region.Queue.iterate(Queue.java:1486)
                        at org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:129)
                        at org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:47)
                        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
                        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
                        at java.lang.Thread.run(Thread.java:679)
        Caused by: java.lang.RuntimeException: java.io.IOException: Invalid location: 129:8716740, : java.lang.NegativeArraySizeException
                        at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:277)
                        at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.hasNext(AbstractStoreCursor.java:147)
                        ... 9 more
        Caused by: java.io.IOException: Invalid location: 129:8716740, : java.lang.NegativeArraySizeException
                        at org.apache.kahadb.journal.DataFileAccessor.readRecord(DataFileAccessor.java:92)
                        at org.apache.kahadb.journal.Journal.read(Journal.java:604)
                        at org.apache.activemq.store.kahadb.MessageDatabase.load(MessageDatabase.java:879)
                        at org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:1030)
                        at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:558)
                        at org.apache.kahadb.page.Transaction.execute(Transaction.java:769)
                        at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:547)
                        at org.apache.activemq.store.ProxyMessageStore.recoverNextMessages(ProxyMessageStore.java:106)
                        at org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:97)
                        at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:274)
                        ... 10 more
        
        
        Show
        Rob Waite added a comment - Also... today which is about 18 hours later... we still do see some errors popping in our AMQ instance that seems related to this issue. Perhaps our data store is still corrupted/odd and we only see it for certain queues. Things in general do seem to still be working but I don't trust the state that it is in: 2013-02-14 11:02:39,710 | ERROR | Failed to page in more queue messages | org.apache.activemq.broker.region.Queue | ActiveMQ BrokerService[habroker] Task-789 java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: Invalid location: 129:8716740, : java.lang.NegativeArraySizeException at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.hasNext(AbstractStoreCursor.java:150) at org.apache.activemq.broker.region.cursors.StoreQueueCursor.hasNext(StoreQueueCursor.java:131) at org.apache.activemq.broker.region.Queue.doPageInForDispatch(Queue.java:1766) at org.apache.activemq.broker.region.Queue.pageInMessages(Queue.java:1993) at org.apache.activemq.broker.region.Queue.iterate(Queue.java:1486) at org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:129) at org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:47) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) Caused by: java.lang.RuntimeException: java.io.IOException: Invalid location: 129:8716740, : java.lang.NegativeArraySizeException at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:277) at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.hasNext(AbstractStoreCursor.java:147) ... 9 more Caused by: java.io.IOException: Invalid location: 129:8716740, : java.lang.NegativeArraySizeException at org.apache.kahadb.journal.DataFileAccessor.readRecord(DataFileAccessor.java:92) at org.apache.kahadb.journal.Journal.read(Journal.java:604) at org.apache.activemq.store.kahadb.MessageDatabase.load(MessageDatabase.java:879) at org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:1030) at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:558) at org.apache.kahadb.page.Transaction.execute(Transaction.java:769) at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:547) at org.apache.activemq.store.ProxyMessageStore.recoverNextMessages(ProxyMessageStore.java:106) at org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:97) at org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:274) ... 10 more
        Hide
        Claus Ibsen added a comment -

        Any update on this ticket? There has been many fixes et all since 5.5.1 release and we are closing in on 5.9.
        So we need to know if this is resolved already or should be moved to 5.10.

        Show
        Claus Ibsen added a comment - Any update on this ticket? There has been many fixes et all since 5.5.1 release and we are closing in on 5.9. So we need to know if this is resolved already or should be moved to 5.10.
        Hide
        Rob Waite added a comment -

        The most recent stack traces occurred in 5.7.0. I have been watching commits and have not seen anything that seemed like it would have addressed this.

        I am pretty sure that losing the SAN or ability to write to disk is not handled correctly by the KahaDB-related code. I have seen LevelDB commits. Is there an opinion among the AMQ devs on whether KahaDB code will be deprecated in favor of another store?

        Show
        Rob Waite added a comment - The most recent stack traces occurred in 5.7.0. I have been watching commits and have not seen anything that seemed like it would have addressed this. I am pretty sure that losing the SAN or ability to write to disk is not handled correctly by the KahaDB-related code. I have seen LevelDB commits. Is there an opinion among the AMQ devs on whether KahaDB code will be deprecated in favor of another store?
        Hide
        Claus Ibsen added a comment -

        Yeah over time leveldb is intended as the preferred choice over kahadb. Though it takes time to be feature complete, and stabilize into the broker.

        Show
        Claus Ibsen added a comment - Yeah over time leveldb is intended as the preferred choice over kahadb. Though it takes time to be feature complete, and stabilize into the broker.
        Claus Ibsen made changes -
        Fix Version/s 5.10.0 [ 12324950 ]
        Fix Version/s 5.9.0 [ 12323932 ]
        Hide
        Gary Tully added a comment -

        @Rob Waite - do you have a scenario that is known to fail, a little test case. It is possible to configure the IOExceptionHandler.

        <ioExceptionHandler>
           <defaultIOExceptionHandler stopStartConnectors="true" />
         </ioExceptionHandler>

        . There may be an easy fix if there is a simple test case. The shutdown logic is better on trunk (5.9 snap) and we auto recreate a corrupt index on restart so things could well be better. The original test case for the IOException handler revolved around disk-Full and disk-not-full recovery. Your scenario seems a little more random. But if we can narrow it down or break it down there may be some simple tweaks we can do to improve.

        Show
        Gary Tully added a comment - @Rob Waite - do you have a scenario that is known to fail, a little test case. It is possible to configure the IOExceptionHandler. <ioExceptionHandler> <defaultIOExceptionHandler stopStartConnectors= " true " /> </ioExceptionHandler> . There may be an easy fix if there is a simple test case. The shutdown logic is better on trunk (5.9 snap) and we auto recreate a corrupt index on restart so things could well be better. The original test case for the IOException handler revolved around disk-Full and disk-not-full recovery. Your scenario seems a little more random. But if we can narrow it down or break it down there may be some simple tweaks we can do to improve.
        Rob Waite made changes -
        Attachment AMQ-3725-10112013.txt [ 12608043 ]
        Hide
        Rob Waite added a comment - - edited

        So I reattempted with the latest snapshot (apache-activemq-5.9-20131011.032953-115-bin) and it did seem a lot better. My test is really just to make a producer that sends messages in a tight loop and to use a USB thumbdrive for the data directory. Then, while the producer is writing as fast as it can, I yank out the thumbdrive, wait a bit, and put it back in and wait for AMQ to handle it.

        We have seen issues with our SAN becoming unavailable and seeing very similar stack traces.

        With the new snapshot... it seems like it recovers most of the time. Twice (in maybe 15 minutes of testing) it would not start back up. The first time it seemed like the USB drive "data" folder got corrupted... I couldnt read from the data folder from windows (it would say it was corrupted). I reformatted it and tried again and the most recent error seemed like it would recover but then a very vague null pointer error happened.

        Attached is the output (AMQ-3725-10112013.txt)... at the end you can see it tried to restart and then a nullpointerexception occurred and it halted (line 501).

        I know a thumbdrive would not be used in prod... but the exceptions seemed very similar to what we have seen with our NFS store.

        Overall, it is much more resilient than 5.7.0. Of the maybe 20 times I tried... 18 times it would restart automatically.

        Show
        Rob Waite added a comment - - edited So I reattempted with the latest snapshot (apache-activemq-5.9-20131011.032953-115-bin) and it did seem a lot better. My test is really just to make a producer that sends messages in a tight loop and to use a USB thumbdrive for the data directory. Then, while the producer is writing as fast as it can, I yank out the thumbdrive, wait a bit, and put it back in and wait for AMQ to handle it. We have seen issues with our SAN becoming unavailable and seeing very similar stack traces. With the new snapshot... it seems like it recovers most of the time. Twice (in maybe 15 minutes of testing) it would not start back up. The first time it seemed like the USB drive "data" folder got corrupted... I couldnt read from the data folder from windows (it would say it was corrupted). I reformatted it and tried again and the most recent error seemed like it would recover but then a very vague null pointer error happened. Attached is the output ( AMQ-3725 -10112013.txt)... at the end you can see it tried to restart and then a nullpointerexception occurred and it halted (line 501). I know a thumbdrive would not be used in prod... but the exceptions seemed very similar to what we have seen with our NFS store. Overall, it is much more resilient than 5.7.0. Of the maybe 20 times I tried... 18 times it would restart automatically.
        Rob Waite made changes -
        Comment [ So I reattempted with the latest snapshot (apache-activemq-5.9-20131011.032953-115-bin) and it did seem a lot better. My test is really just to make a producer that sends messages in a tight loop and to use a USB thumbdrive for the data directory. Then, while the producer is writing as fast as it can, I yank out the thumbdrive.

        We have seen issues with our SAN becoming unavailable and seeing very similar stack traces.

        With the new snapshot... it seems like it recovers most of the time. Twice (in maybe 15 minutes of testing) it would not start back up. The first time it seemed like the USB drive "data" folder got corrupted... I couldnt read from the data folder from windows (it woudl say it was corrupted). I reformatted it and tried again and the most recent error seemed like it would recover but then a very vague null pointer error happened.

        Attached is the output... at the end you can see it tried to restart and then a nullpointerexception occurred and it halted.

        I know a thumbdrive would not be used in prod... but the exceptions seemed very similar to what we have seen with our NFS store. ]
        Hide
        Dejan Bosanac added a comment -

        Hi,

        I just pushed a change that should help with this scenario. I also tested ti with the USB drive and it seems that now KahaDB recovers properly. I also used a bit longer resume period (30 sec)

        <ioExceptionHandler>
        <defaultIOExceptionHandler stopStartConnectors="true" resumeCheckSleepPeriod="30000" />
        </ioExceptionHandler>

        so I can have a bit more time to unplug and plug the drive back in before broker tries to recreate the folder.

        I started a new snapshot build, but I'm not sure if it's gonna be built soon. You can build it locally anyways. Test it out and let me know if the problem is still there.

        Show
        Dejan Bosanac added a comment - Hi, I just pushed a change that should help with this scenario. I also tested ti with the USB drive and it seems that now KahaDB recovers properly. I also used a bit longer resume period (30 sec) <ioExceptionHandler> <defaultIOExceptionHandler stopStartConnectors="true" resumeCheckSleepPeriod="30000" /> </ioExceptionHandler> so I can have a bit more time to unplug and plug the drive back in before broker tries to recreate the folder. I started a new snapshot build, but I'm not sure if it's gonna be built soon. You can build it locally anyways. Test it out and let me know if the problem is still there.
        Show
        Dejan Bosanac added a comment - Here's a snapshot containing a fix https://repository.apache.org/content/repositories/snapshots/org/apache/activemq/apache-activemq/5.10-SNAPSHOT/apache-activemq-5.10-20131101.033855-13-bin.tar.gz to be tested
        Hadrian Zbarcea made changes -
        Fix Version/s 5.9.1 [ 12326455 ]
        Hide
        Claus Ibsen added a comment -

        Did anyone test the SNAPSHOT with a fix

        Show
        Claus Ibsen added a comment - Did anyone test the SNAPSHOT with a fix
        Claus Ibsen made changes -
        Assignee Dejan Bosanac [ dejanb ]
        Hide
        Claus Ibsen added a comment -

        Dejan Bosanac should we assume this is fixed and close the ticket?

        Show
        Claus Ibsen added a comment - Dejan Bosanac should we assume this is fixed and close the ticket?
        Hadrian Zbarcea made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Dejan Bosanac
            Reporter:
            Jason Sherman
          • Votes:
            3 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development