ActiveMQ
  1. ActiveMQ
  2. AMQ-3725

Kahadb error during SAN failover delayed write - Allow kahaDB to recover in a similar manner as the JDBC store using the IOExceptionHandler

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.5.1
    • Fix Version/s: 5.9.1, 5.10.0
    • Component/s: Message Store
    • Labels:
      None

      Description

      An issue can arise that causes the broker to terminate when using kahaDB with a SAN, when the SAN fails over. In this case the failover process is seamless however, on fail back there is a 2-3 sec delay where writes are blocked and the broker terminates. With the JDBC datastore a similar situation can be handled by using the IOExceptionHandler. However with kahaDB, when this same IOExceptionHandler is added it prevents the broker from terminating but kahaDB retains an invalid index.

       INFO | ActiveMQ JMS Message Broker (Broker1, ID:macbookpro-251a.home-56915-1328715089252-0:1) started
       INFO | jetty-7.1.6.v20100715
       INFO | ActiveMQ WebConsole initialized.
       INFO | Initializing Spring FrameworkServlet 'dispatcher'
       INFO | ActiveMQ Console at http://0.0.0.0:8161/admin
       INFO | ActiveMQ Web Demos at http://0.0.0.0:8161/demo
       INFO | RESTful file access application at http://0.0.0.0:8161/fileserver
       INFO | FUSE Web Console at http://0.0.0.0:8161/console
       INFO | Started SelectChannelConnector@0.0.0.0:8161
      ERROR | KahaDB failed to store to Journal
      java.io.SyncFailedException: sync failed
      	at java.io.FileDescriptor.sync(Native Method)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:382)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
       INFO | Ignoring IO exception, java.io.SyncFailedException: sync failed
      java.io.SyncFailedException: sync failed
      	at java.io.FileDescriptor.sync(Native Method)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:382)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
      ERROR | Checkpoint failed
      java.io.SyncFailedException: sync failed
      	at java.io.FileDescriptor.sync(Native Method)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:382)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
       INFO | Ignoring IO exception, java.io.SyncFailedException: sync failed
      java.io.SyncFailedException: sync failed
      	at java.io.FileDescriptor.sync(Native Method)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:382)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
      ERROR | KahaDB failed to store to Journal
      java.io.FileNotFoundException: /Volumes/NAS-01/data/kahadb/db-1.log (No such file or directory)
      	at java.io.RandomAccessFile.open(Native Method)
      	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
      	at org.apache.kahadb.journal.DataFile.openRandomAccessFile(DataFile.java:70)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:324)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
       INFO | Ignoring IO exception, java.io.FileNotFoundException: /Volumes/NAS-01/data/kahadb/db-1.log (No such file or directory)
      java.io.FileNotFoundException: /Volumes/NAS-01/data/kahadb/db-1.log (No such file or directory)
      	at java.io.RandomAccessFile.open(Native Method)
      	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
      	at org.apache.kahadb.journal.DataFile.openRandomAccessFile(DataFile.java:70)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:324)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
      ERROR | KahaDB failed to store to Journal
      java.io.FileNotFoundException: /Volumes/NAS-01/data/kahadb/db-1.log (No such file or directory)
      	at java.io.RandomAccessFile.open(Native Method)
      	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
      	at org.apache.kahadb.journal.DataFile.openRandomAccessFile(DataFile.java:70)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:324)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
       INFO | Ignoring IO exception, java.io.FileNotFoundException: /Volumes/NAS-01/data/kahadb/db-1.log (No such file or directory)
      java.io.FileNotFoundException: /Volumes/NAS-01/data/kahadb/db-1.log (No such file or directory)
      	at java.io.RandomAccessFile.open(Native Method)
      	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
      	at org.apache.kahadb.journal.DataFile.openRandomAccessFile(DataFile.java:70)
      	at org.apache.kahadb.journal.DataFileAppender.processQueue(DataFileAppender.java:324)
      	at org.apache.kahadb.journal.DataFileAppender$2.run(DataFileAppender.java:203)
       WARN | Transport failed: java.io.EOFException
       WARN | Transport failed: java.io.EOFException
       INFO | KahaDB: Recovering checkpoint thread after death
      ERROR | Checkpoint failed
      java.io.IOException: Input/output error
      	at java.io.RandomAccessFile.write(Native Method)
      	at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1001)
      	at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:1006)
      	at org.apache.kahadb.page.PageFile.flush(PageFile.java:484)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1290)
      	at org.apache.activemq.store.kahadb.MessageDatabase$10.execute(MessageDatabase.java:768)
      	at org.apache.kahadb.page.Transaction.execute(Transaction.java:760)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:766)
      	at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:315)
       INFO | Ignoring IO exception, java.io.IOException: Input/output error
      java.io.IOException: Input/output error
      	at java.io.RandomAccessFile.write(Native Method)
      	at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1001)
      	at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:1006)
      	at org.apache.kahadb.page.PageFile.flush(PageFile.java:484)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1290)
      	at org.apache.activemq.store.kahadb.MessageDatabase$10.execute(MessageDatabase.java:768)
      	at org.apache.kahadb.page.Transaction.execute(Transaction.java:760)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:766)
      	at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:315)
       INFO | KahaDB: Recovering checkpoint thread after death
      ERROR | Checkpoint failed
      java.io.IOException: Input/output error
      	at java.io.RandomAccessFile.write(Native Method)
      	at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1001)
      	at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:1006)
      	at org.apache.kahadb.page.PageFile.flush(PageFile.java:484)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1290)
      	at org.apache.activemq.store.kahadb.MessageDatabase$10.execute(MessageDatabase.java:768)
      	at org.apache.kahadb.page.Transaction.execute(Transaction.java:760)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:766)
      	at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:315)
       INFO | Ignoring IO exception, java.io.IOException: Input/output error
      java.io.IOException: Input/output error
      	at java.io.RandomAccessFile.write(Native Method)
      	at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1001)
      	at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:1006)
      	at org.apache.kahadb.page.PageFile.flush(PageFile.java:484)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1290)
      	at org.apache.activemq.store.kahadb.MessageDatabase$10.execute(MessageDatabase.java:768)
      	at org.apache.kahadb.page.Transaction.execute(Transaction.java:760)
      	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:766)
      	at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:315)
       WARN | Transport failed: java.io.EOFException
      

        Activity

        Jason Sherman created issue -
        Rob Davies made changes -
        Field Original Value New Value
        Fix Version/s 5.6.1 [ 12320766 ]
        Fix Version/s 5.x [ 12316331 ]
        Fix Version/s 5.6.0 [ 12317974 ]
        Dejan Bosanac made changes -
        Fix Version/s 5.7.0 [ 12321258 ]
        Fix Version/s 5.6.1 [ 12320766 ]
        Claus Ibsen made changes -
        Fix Version/s 5.8.0 [ 12323282 ]
        Fix Version/s 5.7.0 [ 12321258 ]
        Gary Tully made changes -
        Fix Version/s 5.9.0 [ 12323932 ]
        Fix Version/s 5.8.0 [ 12323282 ]
        Claus Ibsen made changes -
        Fix Version/s 5.10.0 [ 12324950 ]
        Fix Version/s 5.9.0 [ 12323932 ]
        Rob Waite made changes -
        Attachment AMQ-3725-10112013.txt [ 12608043 ]
        Rob Waite made changes -
        Comment [ So I reattempted with the latest snapshot (apache-activemq-5.9-20131011.032953-115-bin) and it did seem a lot better. My test is really just to make a producer that sends messages in a tight loop and to use a USB thumbdrive for the data directory. Then, while the producer is writing as fast as it can, I yank out the thumbdrive.

        We have seen issues with our SAN becoming unavailable and seeing very similar stack traces.

        With the new snapshot... it seems like it recovers most of the time. Twice (in maybe 15 minutes of testing) it would not start back up. The first time it seemed like the USB drive "data" folder got corrupted... I couldnt read from the data folder from windows (it woudl say it was corrupted). I reformatted it and tried again and the most recent error seemed like it would recover but then a very vague null pointer error happened.

        Attached is the output... at the end you can see it tried to restart and then a nullpointerexception occurred and it halted.

        I know a thumbdrive would not be used in prod... but the exceptions seemed very similar to what we have seen with our NFS store. ]
        Hadrian Zbarcea made changes -
        Fix Version/s 5.9.1 [ 12326455 ]
        Claus Ibsen made changes -
        Assignee Dejan Bosanac [ dejanb ]
        Hadrian Zbarcea made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Dejan Bosanac
            Reporter:
            Jason Sherman
          • Votes:
            3 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development