|
Deadlock after suggested patch, generated by testConcurrency unit test
Deadlock generated by testConcurrency
Unit test generating deadlock using dbcp testDriver
Like
I have attached a test case that will generate the deadlock using the dbcp test drivers. The loop is bounded, so deadlock may not always happen. Use ctrl-break (I think) under Windows or ctrl \ on unix command line to generate thread dump once the test case hangs. I am moving this out to 1.3 fix version, at which time we will have to change either a) the pool impl or b) the way we dbcp uses the pool (see http://issues.apache.org/jira/browse/DBCP-65#action_12451390 After looking at the deadlock traces and analyzing the code some more, I think that the root of the problem in this case is contention for the DriverManager. The DriverManager methods in the traces are static synchronized. What is going on is that direct client calls to DriverManager.getConnection immediately acquire the static lock on this class and then use PoolingDriver, which needs to lock the pool. The pool Timer locks the pool and tries to use the DriverManager to access the the jdbc driver to create a new concrete connection, but can't acquire the static lock on DriverManager. The order of locks here is beyond control of dbcp. So, unless and until the Timer mechanism of pool is reworked somehow, I don't think this can be fixed.
The workaround is to either disable the pool Timer or avoid using the DriverManager as a client with PoolingDriver. Should we make a Pool issue for this?
This is an architectural problem in commons-pool. GenericObjectPool.borrowObject() is synchronized so when it needs to make a new object, that method is called within a synchronized block. Here are the stack traces of the deal lock in our case:
main@1, priority=5, in group 'main', status: 'MONITOR' Timer-0@fc daemon, priority=5, in group 'main', status: 'MONITOR' This is a classic two lock dead lock. The main thread acquires the locks on DriveManager then GenericObjectPool, and at the same time the timer thread attempts to acquire them in the opposite order GenericObjectPool and then DriveManger. There are really only two ways to fix this 1) guarantee locks are always acquired in the same order or 2) eliminate one of the locks. The first one is not really possible as both objects are generally publicly accessible, and the second fix will require a complex change to the GenericObjectPool code. Attached is a patch that adds the Deadlock class, which has a main method that causes the dead lock. The patch also fixes the synchronization in PoolableConnectionFactory. I also added property to TesterDriver to cause the thread to sleep in getConnection so the dead lock can be reproduced. I'm going to attempt to write a new pool using the Java5 concurrent code which doesn't call makeObject inside of a synchronized block. Hi Dain, I agree that the evictor thread is problematic (it should go away and/or be replaced by a custom evictor thingy that gets scheduled externally), but it is not clear to me how you could make makeObject() non-synchronized. If you want to go for the usual create-and-CAS pattern you risk a potential double allocation, and pool is definitely used for situations where this would be very bad.
A multi-staged locking hierarchy (implemented with the new ReentrantReadWrite locks) could help us to the effect that you outlined wrt. the lock ordering. Entering readers (pool consumers) all would have to pass the global writer lock entrance; once someone is inside, the periodic evictor has to enqueue at the fron t door until he gets the write lock, and can proceed. Completely concurrent (lock-free) eviction would be much harder. A mutli-staged locking hierarchy would also enable us to properly decouple some of the other paths inside borrow/returnObject. For returnObject I have already submitted such a simple patch. This will take experimentation, but off the top of my head, I'd do somthing like this:
while(true) { Future future = this.allocationFutureAtomic.get(); The AllocationFuture would create new connection(s), add them to the idle pool, and then clear the future atomic. Of course this can be done with primitives, but it is more painful to code. It would be cleaner to use an executor to process the future, so it can potentially be done on another thread which isn't holding locks, and if the makeObject() hangs you, don't have to interrupt an application thread. All of this becomes more difficult when you try to implement a keyed pool due to double book keeping of a global count and per key count. Anyway, I'll experiment with some implementations and if I come up with anything good, I'll post it. BTW it isn't the evictor thread that is problematic per say (it could have other issues
Ok one more comment before I get lunch.... My sample code above is bad. It is a hard loop (duh). Instead of hard looping, the second check on the idleCollection could wait for a item to become available (easy with a concurrent queue). Which leave the possibility that someone sneaks in and steals the new connection we were waiting for
We've seen a deaklock in our env, when thread 1 holds lock A then wants lock B, thread 2 holds lock B then wants lock A. As other threads tried to obtain lock B they started to hang, until almost all threads were hung. Is this the same issue.
2LKDEADLOCKTHR Thread "Thread-5351" (0x7E9B46F0) I've encountered what appears to be the same deadlock when returning a connection to the pool.
commons-dbcp-1.2.2 Is there any work being done on this? This bug has been open for quite some time. Are there any workarounds to avoid this issue? Googling simply reveals other deadlocks with similar stack traces. Here is my stack trace: "Timer-0": (Sorry for all the edits!) It's funny, I've been using DBCP for years without having run across this issue once, and now I've had two separate applications run across the same deadlock within a week.
I have another deadlock which is identical to the one above, except that the thread closing the connection is a Servlet running in Tomcat, but the cause is the same. The Evictor thread runs, locks the GenericObjectPool, enters validateConnection and then attempts to lock a Poolable Connection. I've been analyzing the deadlock and it appears that the initial suggested workaround of setting minIdle to 0. In my previous comment, the application did have a minIdle setting of 1. However, in my latest deadlock, minIdle was set to zero. However, in both cases, the testWhileIdle is to true which appears to cause the Evictor to run. So to really avoid this issue, it appears that you should set minIdle to 0 AND turn off testWhileIdle. Or is minIdle really enough (and I am missing something?) I've been commenting on the wrong bug it appears, my deadlock case appears to be the same as
This is a new issue, though the basic problem of contention between the Evictor and client threads is the same. Turning off testWhileIdle or just setting timeBetweenEvictionRuns to the default (-1) so the evictor does not run should prevent this.
Looks to me like what is going on here is 1) Client thread ("SocketRequest-8") calls close on a connection from the pool. This locks the PoolableConnection. This is new because pool 1.4 reduced synchronization scope in returnObject, which used to be fully synchronized (opening the door for 3). We could revert to full synchronization of returnObject in pool, but that was a source of performance problems and I think the root cause here is really how DBCP uses the GOP Evictor without controlling client locks and we need to find a solution for that problem. Reverting to pool 1.3 will also eliminate this particular deadlock. Phil, thanks for the comment. Should I open a new bug with this particular deadlock scenario? Pool 1.4 could explain why I'm hitting it recently, I've only started using Pool 1.4 in the past month or so.
The new deadlock is definitely due to pool 1.4. Lets keep tracking the DBCP issue here, since the basic problem is the same. I will open a separate issue against pool.
The last deadlock above should be resolved by the fix applied in r 672097 for
I have attached a possible solution to this issue that removes the sync from addObject()
The upside is it should no longer be possible for the evcitor to be the source of a deadlock. The downside is a pool where (idle==0 && (active + 1)==maxActive && minIdle>0) may appear to be exhausted to a client if the evictor thread is in the process of creating the last object when the client calls borrowObject(). This is really only an issue if the pool is configured with WHEN_EXHAUSTED_FAIL. Assuming my patch is correct, the question becomes is it better to have the risk of a deadlock or better to have the risk of the edge case described above. My own view is that the deadlock is worse. Thoughts? Agree that if the patch works as designed, it should be applied. I am out of pocket for next few days but will analyze and test to verify. Need to make sure removing synch from addObject does not create possibility of maxActive exceeded.
If the above patch is viewed as OK then similar patches will be required for the other pools to fully resolve this issue.
I have reviewed and tested the patch and am +1 for applying it and resolving this issue, with closure pending pool 1.5 release.
The change in AbandonedTrace synchronization scope made to address I agree that GenericKeyedObjectPool needs to be similarly patched. I agree with Dain's comment on "lock holding open outcall..bound to cause deadlocks" and opened Just discovered that my proposed patch also fixes
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
bug is only when using PoolingDriver and a setting of minIdle > 0 in the
GenericObjectPool configuration.
Suggested interim workarounds:
a) set minIdle = 0 in your configuration for the GenericObjectPool
b) if you know you are connection pooling within the application code, retrieve
a connection as follows:
ObjectPool connectionPool = poolingDriver.getConnectionPool(name);
Connection conn = (Connection)connectionPool.borrowObject();
This second method will allow you to have minIdle > 0 while avoiding the
deadlock code path between threads.