Description
While doing long-term performance testing of our application (~10,000 messages/second), an exception is thrown after a few hours of operation. Here is the details of the exception:
IllegalMonitorStateException: attempt to unlock read lock, not locked by current thread
Stack trace:
`anonymous namespace}}'::Sync::tryReleaseShared(int unused=1) Line 205
decaf::util::concurrent::locks::AbstractQueuedSynchronizer::releaseShared(int arg=1) Line 1630 + 0x11 bytes
`anonymous namespace'::ReadLock::unlock() Line 660
activemq::core::kernels::ActiveMQSessionKernel::lookupConsumerKernel(...) Line 1336
activemq::core::ActiveMQSessionExecutor::dispatch(...) Line 151 + 0x47 bytes
activemq::core::ActiveMQSessionExecutor::iterate() Line 182
activemq::threads::DedicatedTaskRunner::run() Line 141 + 0x13 bytes
decaf::lang::Thread::run() Line 143
After a little debugging, I identified a code defect that seems to be the cause of our problem. In class decaf::util::concurrent::locks::ReentrantReadWriteLock, the class member "cachedHoldCounter" is used to optimize performance. However, that member is accessed concurrently by multiple thread, but the modifications of that member are not atomic, which implies that a thread can read a partly updated member (i.e. the count of thread #2 with pointer to thread #1). In that case, lock logic get all messed up, and we end up with strange behavior (eg. infinite waiting for lock).
I wrote a unit test to reproduce the problem (see attachment). However, since this is a race condition, it may take a few run to reproduce.
When I commented cachedHoldCounter-related code from ReentrantReadWriteLock (i.e. always go in ThreadLocal), the problem seems to be gone.