Issue Details (XML | Word | Printable)

Key: DERBY-3786
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Knut Anders Hatlen
Reporter: Kristian Waagan
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Derby

Assert failure in CacheEntry.unkeepForRemove when running stress.multi

Created: 21/Jul/08 08:15 AM   Updated: 04/May/09 06:22 PM
Return to search
Component/s: Services
Affects Version/s: 10.5.1.1
Fix Version/s: 10.4.2.0, 10.5.1.1

Time Tracking:
Not Specified

File Attachments:
  Size
Java Source File Licensed for inclusion in ASF works d3786.java 2008-07-22 11:52 AM Knut Anders Hatlen 2 kB
File Licensed for inclusion in ASF works remove.diff 2008-07-22 03:17 PM Knut Anders Hatlen 2 kB
File Licensed for inclusion in ASF works remove_v2.diff 2008-08-20 08:17 PM Knut Anders Hatlen 4 kB
Environment: 32 hardware execution threads
Issue Links:
Duplicate
 

Resolution Date: 23/Aug/08 07:48 AM


 Description  « Hide
When running stress.multi on a machine with 32 hardware execution threads, I observed the following assert failure in two independent runs:
-----
2008-07-18 02:15:14.415 GMT Thread[Thread-8,5,workers] (XID = 94699), (SESSIONID = 16923), (DATABASE = mydb), (DRDAID = null), Failed Statement is: insert into a values (1)
org.apache.derby.shared.common.sanity.AssertFailure: ASSERT FAILED
        at org.apache.derby.shared.common.sanity.SanityManager.ASSERT(SanityManager.java:98)
        at org.apache.derby.impl.services.cache.CacheEntry.unkeepForRemove(CacheEntry.java:217)
        at org.apache.derby.impl.services.cache.ConcurrentCache.remove(ConcurrentCache.java:446)
        at org.apache.derby.impl.sql.conn.GenericLanguageConnectionContext.removeStatement(GenericLanguageConnectionContext.java:898)
        at org.apache.derby.impl.sql.GenericStatement.prepMinion(GenericStatement.java:516)
        at org.apache.derby.impl.sql.GenericStatement.prepare(GenericStatement.java:88)
        at org.apache.derby.impl.sql.conn.GenericLanguageConnectionContext.prepareInternalStatement(GenericLanguageConnectionContext.java:794)
        at org.apache.derby.impl.jdbc.EmbedStatement.execute(EmbedStatement.java:606)
        at org.apache.derby.impl.jdbc.EmbedStatement.execute(EmbedStatement.java:555)
        at org.apache.derby.impl.tools.ij.ij.executeImmediate(ij.java:329)
        at org.apache.derby.impl.tools.ij.utilMain.doCatch(utilMain.java:508)
        at org.apache.derby.impl.tools.ij.utilMain.runScriptGuts(utilMain.java:350)
        at org.apache.derby.impl.tools.ij.utilMain.go(utilMain.java:248)
        at org.apache.derby.impl.tools.ij.mtTestCase.runMe(mtTestCase.java:246)
        at org.apache.derby.impl.tools.ij.mtTester.run(mtTester.java:91)
        at java.lang.Thread.run(Thread.java:619)
Cleanup action completed
-----


In stress.log:
-----
Tester8: insert2 Fri Jul 18 04:15:12 CEST 2008
Tester6: TERMINATING due to unexpected error:
FatalException: XJ001: Java exception: 'ASSERT FAILED: org.apache.derby.shared.common.sanity.AssertFailure'.
Tester1: SELECT2 Fri Jul 18 04:15:13 CEST 2008
Tester8: stopping on request after 820 iterations
Tester10: stopping on request after 859 iterations
Tester1: stopping on request after 847 iterations
Tester7: TERMINATING due to unexpected error:
FatalException: XJ001: Java exception: 'ASSERT FAILED: org.apache.derby.shared.common.sanity.AssertFailure'.
Tester9: stopping on request after 722 iterations
Tester3: stopping on request after 880 iterations
Tester5: stopping on request after 858 iterations
Tester4: stopping on request after 839 iterations
WARNING: testers didn't die willingly, so I'm going to kill 'em.
        This may result in connection resources that aren't cleaned up
        (e.g. you may see problems in the final script run with deadlocks).
-----

A few runs on a similar but slightly slower machine didn't experience the same failure, so the bug is likely timing dependent.
I'll perform some more runs and see how hard it is to reproduce.

I haven't investigated what will happen in an insane build.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Knut Anders Hatlen added a comment - 21/Jul/08 10:38 AM
It looks like one thread calls ConcurrentCache.remove() while another thread is blocked inside the same method. According to the javadoc for CacheManager.remove(), callers of remove() must synchronize so that no two threads try to remove the same object concurrently:

/**
Delete and remove an object from the cache. It is up to the user of the cache
to provide synchronization of some form that ensures that only one caller
executes remove() on a cached object.
<BR>

Knut Anders Hatlen added a comment - 22/Jul/08 11:52 AM
Attaching code that reproduces the assert failure on my machine (dual core). It creates a table without committing in one transaction and starts 100 threads that prepare a query against that table in other transactions. The preparing of the query fails with a timeout exception, which leads to GenericLanguageConnectionContext.removeStatement() and CacheManager.remove() being called, and there seems to be a race condition between the threads that time out at approximately the same time.

I also tried to run it on trunk in insane mode. Then the assert error didn't occur, of course, but more and more threads became idle, until all the threads were blocked.

Knut Anders Hatlen added a comment - 22/Jul/08 11:56 AM
Tried the repro with JDK 1.4 to see if we get the same behaviour with the old cache manager (Clock) as with the new one (ConcurrentCache). I then got this assert failure:

java.sql.SQLException: Java exception: 'ASSERT FAILED: org.apache.derby.shared.common.sanity.AssertFailure'.
        at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45)
        at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Util.java:87)
        at org.apache.derby.impl.jdbc.Util.javaException(Util.java:244)
        at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(TransactionResourceImpl.java:403)
        at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:346)
        at org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2192)
        at org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:81)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement.<init>(EmbedPreparedStatement.java:146)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement20.<init>(EmbedPreparedStatement20.java:82)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement30.<init>(EmbedPreparedStatement30.java:63)
        at org.apache.derby.jdbc.Driver30.newEmbedPreparedStatement(Driver30.java:99)
        at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(EmbedConnection.java:1533)
        at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(EmbedConnection.java:1361)
        at d3786$1.run_(d3786.java:26)
        at d3786$1.run(d3786.java:15)
        at java.lang.Thread.run(Thread.java:534)
Caused by: org.apache.derby.shared.common.sanity.AssertFailure: ASSERT FAILED
        at org.apache.derby.shared.common.sanity.SanityManager.ASSERT(SanityManager.java:98)
        at org.apache.derby.impl.services.cache.CachedItem.unkeep(CachedItem.java:144)
        at org.apache.derby.impl.services.cache.Clock.remove(Clock.java:603)
        at org.apache.derby.impl.sql.conn.GenericLanguageConnectionContext.removeStatement(GenericLanguageConnectionContext.java:898)
        at org.apache.derby.impl.sql.GenericStatement.prepMinion(GenericStatement.java:516)
        at org.apache.derby.impl.sql.GenericStatement.prepare(GenericStatement.java:88)
        at org.apache.derby.impl.sql.conn.GenericLanguageConnectionContext.prepareInternalStatement(GenericLanguageConnectionContext.java:794)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement.<init>(EmbedPreparedStatement.java:128)
        ... 8 more


The failing assert is this one in CachedItem.unkeep():

if (SanityManager.DEBUG) {
SanityManager.ASSERT(keepCount >= 0);
}

Knut Anders Hatlen added a comment - 22/Jul/08 12:02 PM
Looking at the stack trace from JDK 1.4, I think this issue is the same as DERBY-1053.

Knut Anders Hatlen added a comment - 22/Jul/08 03:17 PM
The code that calls removeStatement() uses a synchronized flag and
wait()+notifyAll() to prevent multiple threads from recompiling the
same statement concurrently. The code synchronizes on the
GenericPreparedStatement instance that is being compiled. What
happens, is basically:

  1. fetch the prepared statement from the statement cache, possibly
  creating a new one if it isn't in the cache

  2. release the statement (which means others can removed it from the
  cache)

  3. if someone else is recompiling the prepared statement, wait until
  they have finished

  4. recompile the statement and block others until finished

  5. if an error happens during recompile

    a) check if it exists in the cache

    b) if it is in the cache, remove it

If two threads do this concurrently, they'll both get a reference to
the same prepared statement in the cache in step (1). Then one of the
threads is allowed to recompile the statement, whereas the other
thread is blocked in (3). In the repro, the thread that is recompiling
the statement fails with a lock timeout, so it removes the statement
from the cache. The other thread is allowed to continue, and also
fails while recompiling. The statement is no longer in the cache, so
it doesn't try to remove it, and everything is fine.

Now, consider the possibility that a third thread tries to prepare the
same SQL text right after the first thread has removed the prepared
statement from the cache. This means that thread 3 doesn't find the
statement in the cache and therefore creates a new prepared
statement. Since this is another instance of GenericPreparedStatement
than the one thread 1 and thread 2 are working on, thread 3 won't
detect that others are recompiling the same statement.

If the compilation succeeds, it is no problem that two different
GenericPreparedStatement instances are used for the same SQL
text. However, since also the third thread fails while compiling the
statement, and it is not synchronized with the second thread since
they don't have the same prepared statement, thread 2 and thread 3 may
end up calling removeStatement() concurrently. Although their prepared
statements are different, they'll have the same key in the statement
cache, so now thread 2 will find that the statement in fact is in the
cache and try to remove it. Thread 3 also finds it in the cache and
invokes CacheManager.remove() at the same time. This breaks the
contract of CacheManager.remove(), and bad things start happening
(assert failures and hangs).

I think there are two ways to fix this:

  1) In GenericLanguageConnectionContext.removeStatement(), only call
  remove() if the prepared statement in the cache is exactly the same
  as the prepared statement we tried to recompile. The patch
  remove.diff shows this approach. I haven't run the regression tests,
  but it seems to fix this problem (I don't see assert failures or
  hangs with the repro anymore).

  2) It may also be possible to fix it by not releasing the statement
  in step (2). Then it's not possible to remove the statement from the
  cache until it has been recompiled, and all the threads that try to
  recompile it are guaranteed to work on the same
  GenericPreparedStatement instance. Some extra logic would be needed
  to prevent deadlocks, as the call to remove() would block until no
  one holds that statement, and every other thread that is waiting to
  recompile the same statement will be holding it.

Knut Anders Hatlen added a comment - 22/Jul/08 08:30 PM
Derbyall and suites.All ran cleanly with remove.diff, so unless there are any objections to the approach, I'll create a new patch with some comments added that we can consider for commit.

Knut Anders Hatlen added a comment - 20/Aug/08 08:17 PM
Here's a new version of the patch with comments added. Derbyall and suites.All still run cleanly.

Knut Anders Hatlen added a comment - 22/Aug/08 08:58 AM
Committed revision 688027.

Since this is also a problem in older versions, I'm leaving the issue open until the fix has been back-ported to the 10.4 branch.

Knut Anders Hatlen added a comment - 23/Aug/08 07:48 AM
Merged to 10.4 with revision 688275.

Kristian Waagan added a comment - 27/Aug/08 08:27 AM
I ran stress.multi successfully 80 times on the same machine where I first saw this problem. I also ran the repro (d3786) for a while withouth failures. Thanks for the fix, Knut Anders.

I'm closing this issue and the duplicate.