|
Attaching code that reproduces the assert failure on my machine (dual core). It creates a table without committing in one transaction and starts 100 threads that prepare a query against that table in other transactions. The preparing of the query fails with a timeout exception, which leads to GenericLanguageConnectionContext.removeStatement() and CacheManager.remove() being called, and there seems to be a race condition between the threads that time out at approximately the same time.
I also tried to run it on trunk in insane mode. Then the assert error didn't occur, of course, but more and more threads became idle, until all the threads were blocked. Tried the repro with JDK 1.4 to see if we get the same behaviour with the old cache manager (Clock) as with the new one (ConcurrentCache). I then got this assert failure:
java.sql.SQLException: Java exception: 'ASSERT FAILED: org.apache.derby.shared.common.sanity.AssertFailure'. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45) at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Util.java:87) at org.apache.derby.impl.jdbc.Util.javaException(Util.java:244) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(TransactionResourceImpl.java:403) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:346) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2192) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:81) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.<init>(EmbedPreparedStatement.java:146) at org.apache.derby.impl.jdbc.EmbedPreparedStatement20.<init>(EmbedPreparedStatement20.java:82) at org.apache.derby.impl.jdbc.EmbedPreparedStatement30.<init>(EmbedPreparedStatement30.java:63) at org.apache.derby.jdbc.Driver30.newEmbedPreparedStatement(Driver30.java:99) at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(EmbedConnection.java:1533) at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(EmbedConnection.java:1361) at d3786$1.run_(d3786.java:26) at d3786$1.run(d3786.java:15) at java.lang.Thread.run(Thread.java:534) Caused by: org.apache.derby.shared.common.sanity.AssertFailure: ASSERT FAILED at org.apache.derby.shared.common.sanity.SanityManager.ASSERT(SanityManager.java:98) at org.apache.derby.impl.services.cache.CachedItem.unkeep(CachedItem.java:144) at org.apache.derby.impl.services.cache.Clock.remove(Clock.java:603) at org.apache.derby.impl.sql.conn.GenericLanguageConnectionContext.removeStatement(GenericLanguageConnectionContext.java:898) at org.apache.derby.impl.sql.GenericStatement.prepMinion(GenericStatement.java:516) at org.apache.derby.impl.sql.GenericStatement.prepare(GenericStatement.java:88) at org.apache.derby.impl.sql.conn.GenericLanguageConnectionContext.prepareInternalStatement(GenericLanguageConnectionContext.java:794) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.<init>(EmbedPreparedStatement.java:128) ... 8 more The failing assert is this one in CachedItem.unkeep(): if (SanityManager.DEBUG) { SanityManager.ASSERT(keepCount >= 0); } Looking at the stack trace from JDK 1.4, I think this issue is the same as
The code that calls removeStatement() uses a synchronized flag and
wait()+notifyAll() to prevent multiple threads from recompiling the same statement concurrently. The code synchronizes on the GenericPreparedStatement instance that is being compiled. What happens, is basically: 1. fetch the prepared statement from the statement cache, possibly creating a new one if it isn't in the cache 2. release the statement (which means others can removed it from the cache) 3. if someone else is recompiling the prepared statement, wait until they have finished 4. recompile the statement and block others until finished 5. if an error happens during recompile a) check if it exists in the cache b) if it is in the cache, remove it If two threads do this concurrently, they'll both get a reference to the same prepared statement in the cache in step (1). Then one of the threads is allowed to recompile the statement, whereas the other thread is blocked in (3). In the repro, the thread that is recompiling the statement fails with a lock timeout, so it removes the statement from the cache. The other thread is allowed to continue, and also fails while recompiling. The statement is no longer in the cache, so it doesn't try to remove it, and everything is fine. Now, consider the possibility that a third thread tries to prepare the same SQL text right after the first thread has removed the prepared statement from the cache. This means that thread 3 doesn't find the statement in the cache and therefore creates a new prepared statement. Since this is another instance of GenericPreparedStatement than the one thread 1 and thread 2 are working on, thread 3 won't detect that others are recompiling the same statement. If the compilation succeeds, it is no problem that two different GenericPreparedStatement instances are used for the same SQL text. However, since also the third thread fails while compiling the statement, and it is not synchronized with the second thread since they don't have the same prepared statement, thread 2 and thread 3 may end up calling removeStatement() concurrently. Although their prepared statements are different, they'll have the same key in the statement cache, so now thread 2 will find that the statement in fact is in the cache and try to remove it. Thread 3 also finds it in the cache and invokes CacheManager.remove() at the same time. This breaks the contract of CacheManager.remove(), and bad things start happening (assert failures and hangs). I think there are two ways to fix this: 1) In GenericLanguageConnectionContext.removeStatement(), only call remove() if the prepared statement in the cache is exactly the same as the prepared statement we tried to recompile. The patch remove.diff shows this approach. I haven't run the regression tests, but it seems to fix this problem (I don't see assert failures or hangs with the repro anymore). 2) It may also be possible to fix it by not releasing the statement in step (2). Then it's not possible to remove the statement from the cache until it has been recompiled, and all the threads that try to recompile it are guaranteed to work on the same GenericPreparedStatement instance. Some extra logic would be needed to prevent deadlocks, as the call to remove() would block until no one holds that statement, and every other thread that is waiting to recompile the same statement will be holding it. Derbyall and suites.All ran cleanly with remove.diff, so unless there are any objections to the approach, I'll create a new patch with some comments added that we can consider for commit.
Here's a new version of the patch with comments added. Derbyall and suites.All still run cleanly.
Committed revision 688027.
Since this is also a problem in older versions, I'm leaving the issue open until the fix has been back-ported to the 10.4 branch. Merged to 10.4 with revision 688275.
I ran stress.multi successfully 80 times on the same machine where I first saw this problem. I also ran the repro (d3786) for a while withouth failures. Thanks for the fix, Knut Anders.
I'm closing this issue and the duplicate. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
/**
Delete and remove an object from the cache. It is up to the user of the cache
to provide synchronization of some form that ensures that only one caller
executes remove() on a cached object.
<BR>