Uploaded image for project: 'Qpid'
  1. Qpid
  2. QPID-7685

[Java Broker, BDB] AsyncRecovery and Queue#enqueue can contend for a BDB Lock potentially bringing down the broker

    Details

      Description

      This is the scenario:
      BDB message store.
      AsyncRecovery completes recovery of queue Q1. Queue Q2 is still being recovered.
      An transacted application sends a message to Q1 but does not commit the transaction. This ends up creating an entry in the DeliveryDb through the following call chain:
      > AMQChannel#deliverCurrentMessageIfComplete
      --> AbstractExchange#send
      ----> LocalTransaction#enqueue(List<? extends BaseQueue>, EnqueueableMessage, ServerTransaction.Action)
      ------> AbstractBDBMessageStore.BDBTransaction#enqueueMessage
      --------> AbstractBDBMessageStore#enqueueMessage
      ----------> getDeliveryDb().put
      The transaction will hold a write lock on this new DB entry until the transaction is committed.
      If it now happens that this entry appears immediately after the last entry of Q2 in the database the recovery code will try to read that entry in AbstractBDBMessageStore#visitMessageInstances(TransactionLogResource, MessageInstanceHandler):
      cursor.getNext(key, value, LockMode.DEFAULT)
      Acquisition of the read lock will timeout after je.lock.timeout (default: 500ms) and after exhaustion of the retries (hard coded to 5) a StoreException will be thrown bringing down the broker with the following stack trace (from a broker derived from 0.32):

      ########################################################################
      #
      # Unhandled Exception com.google.common.util.concurrent.UncheckedExecutionException: org.apache.qpid.server.store.StoreException: Cannot visit messages in Thread pool-2-thread-1
      #
      # Exiting
      #
      ########################################################################
      com.google.common.util.concurrent.UncheckedExecutionException: org.apache.qpid.server.store.StoreException: Cannot visit messages
      	at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1358)
      	at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1344)
      	at org.apache.qpid.server.virtualhost.AbstractVirtualHost$4.run(AbstractVirtualHost.java:1619)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.qpid.server.store.StoreException: Cannot visit messages
      	at org.apache.qpid.server.store.berkeleydb.StandardEnvironmentFacade.handleDatabaseException(StandardEnvironmentFacade.java:369)
      	at org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.sleepOrThrowOnLockConflict(AbstractBDBMessageStore.java:665)
      	at org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:233)
      	at org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.recoverQueue(AsynchronousMessageStoreRecoverer.java:163)
      	at org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.access$900(AsynchronousMessageStoreRecoverer.java:86)
      	at org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer$QueueRecoveringTask.run(AsynchronousMessageStoreRecoverer.java:436)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	... 3 more
      Caused by: com.sleepycat.je.LockTimeoutException: (JE 5.0.104) Lock expired. Locker 2109822146 -1_Queue Recoverer : Q2 (vh: default)_ThreadLocker: waited for lock on database=QUEUE_ENTRIES LockAddr:530828558 LSN=0x0/0xa45c type=READ grant=WAIT_NEW timeoutMillis=500 startTime=1488294679798 endTime=1488294680298
      Owners: [<LockInfo locker="879351915 243_IoRcvr-/127.0.0.1:5672-/127.0.0.1:51619_Txn" type="WRITE"/>]
      Waiters: []
      
      	at com.sleepycat.je.txn.LockManager.newLockTimeoutException(LockManager.java:665)
      	at com.sleepycat.je.txn.LockManager.makeTimeoutMsgInternal(LockManager.java:623)
      	at com.sleepycat.je.txn.SyncedLockManager.makeTimeoutMsg(SyncedLockManager.java:97)
      	at com.sleepycat.je.txn.LockManager.lockInternal(LockManager.java:390)
      	at com.sleepycat.je.txn.LockManager.lock(LockManager.java:276)
      	at com.sleepycat.je.txn.BasicLocker.lockInternal(BasicLocker.java:118)
      	at com.sleepycat.je.txn.Locker.lock(Locker.java:443)
      	at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2642)
      	at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2443)
      	at com.sleepycat.je.dbi.CursorImpl.fetchCurrent(CursorImpl.java:2272)
      	at com.sleepycat.je.dbi.CursorImpl.getCurrentAlreadyLatched(CursorImpl.java:1480)
      	at com.sleepycat.je.dbi.CursorImpl.getNext(CursorImpl.java:1607)
      	at com.sleepycat.je.Cursor.searchInternal(Cursor.java:2898)
      	at com.sleepycat.je.Cursor.searchAllowPhantoms(Cursor.java:2732)
      	at com.sleepycat.je.Cursor.searchNoDups(Cursor.java:2586)
      	at com.sleepycat.je.Cursor.search(Cursor.java:2553)
      	at com.sleepycat.je.Cursor.getSearchKeyRange(Cursor.java:1757)
      	at org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:203)
      	... 8 more
      

      I believe the call to cursor.getSearchKeyRange(key,value, LockMode.DEFAULT) has the same problem just under different circumstances.

      I have not checked whether other similar defects exist.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              lorenz.quack Lorenz Quack
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: