|
Knut Anders Hatlen made changes - 16/Mar/09 04:51 PM
Good insight on the getCause() interaction with JUnit -- it would be great if there was
something simple we could do to make the client exceptions more automatically visible to generic software such as JUnit that knows how to unwind getCause but doesn't know about the DBMS-specific exception chaining techniques. Here's a patch that makes the client driver use initCause() on BatchUpdateExceptions, like the embedded driver already does. It also adds a test case which checks that the output from printStackTrace() on a BUE contains the string "duplicate key" if the batch job caused a violation of a unique constraint (without the patch, the test passed with embedded and failed with the client driver). No other tests have been run yet.
Knut Anders Hatlen made changes - 16/Mar/09 09:43 PM
All tests ran cleanly with the patch. Committed revision 755147. So now we just have to wait until the failure happens again and see what the underlying exception is.
I'm planning to backport this change to 10.5 since it improves the error reporting and has very low risk.
Committed the improved error reporting to 10.5 with revision 755156.
More detailed stack trace is available here:
http://dbtg.thresher.com/derby/test/tinderbox_trunk16/jvm1.6/testing/testlog/SunOS-5.10_i86pc-i386/755167-org.apache.derbyTesting.functionTests.suites.All_diff.txt Caused by: org.apache.derby.client.am.SqlException: Error for batch element #67: Java exception: 'org.apache.derby.impl.store.access.btree.WaitError: '. at org.apache.derby.client.am.Statement.completeExecute(Unknown Source) at org.apache.derby.client.net.NetStatementReply.parseEXCSQLSTTreply(Unknown Source) at org.apache.derby.client.net.NetStatementReply.readExecute(Unknown Source) at org.apache.derby.client.net.StatementReply.readExecute(Unknown Source) at org.apache.derby.client.net.NetPreparedStatement.readExecute_(Unknown Source) at org.apache.derby.client.am.PreparedStatement.readExecute(Unknown Source) ... 44 more The WaitError is probably thrown by BTreeController.comparePreviousRecord() when it tries to walk backwards in the B-tree and cannot latch the left sibling page without waiting (probably because post commit work is done on that page by a different thread simultaneously). This is an intended feature of the current implementation, I think. There are a couple of alternatives: * comparePreviousRecord() could release its latches and return RESCAN_REQUIRED when it detects a WaitError * A reimplementation of the nullable unique constraint insert code has been suggested [1] and it would probably fix this issue. Then the test would fail intermittently for a while, but it may be possible to rewrite the test to prevent it. (I think we could turn off auto-commit and run the test in one transaction. Then there won't be post-commit work done in parallel with the execution of the test, and I think it will still expose the bug that it was supposed to expose, but I would have to test that to be sure.) [1] <URL:http://mail-archives.apache.org/mod_mbox/db-derby-dev/200903.mbox/%3C49AEBA96.5010609@sbcglobal.net%3E> > I think we could turn off auto-commit and run the test in one
> transaction. Then there won't be post-commit work done in parallel > with the execution of the test, and I think it will still expose the > bug that it was supposed to expose, but I would have to test that to > be sure. I commented out the fix for the test and didn't see a failure, so if we turn off auto-commit the test will no longer expose the bug that it's supposed to expose. Increasing the number of iterations in testMixedInsertDelete() from 10 to 100 reliably reproduces the WaitError both with in embedded and client/server in my environment.
The attached patch d4097-1a-rescanOnWaitError.diff increases the number of iterations in the test case to more reliably reproduce the WaitError, and it modifies BTreeController.comparePreviousRecord() so that it releases all latches and returns RESCAN_REQUIRED when a WaitError is detected. This is not an ideal solution, as the thread that gets the WaitError may need many tries before it manages to latch the left sibling, and it will be constantly spending CPU until it manages to acquire the latch. Anyways, it should not happen very frequently, and I think a spin-wait loop is better than failing in this situation. (This is yet another reason to rewrite the nullable unique insert code as suggested by Mike in the thread mentioned in an earlier comment.)
NullableUniqueConstraintTest passed with this patch. Will start the rest of the regression test suite now.
Knut Anders Hatlen made changes - 17/Apr/09 12:53 PM
Knut Anders Hatlen made changes - 17/Apr/09 09:29 PM
Derbyall and suites.All ran cleanly. The patch is ready for review.
Knut Anders Hatlen made changes - 17/Apr/09 09:31 PM
Committed to trunk with revision 767143.
I guess this fix should be backported to 10.5, since it was seen a couple of times during release testing, and possibly also to 10.4, so I'm leaving the issue open.
Knut Anders Hatlen made changes - 21/Apr/09 01:45 PM
Committed to 10.5 with revision 767475.
Knut Anders Hatlen made changes - 22/Apr/09 11:55 AM
Committed to 10.4 with revision 767861.
Knut Anders Hatlen made changes - 23/Apr/09 09:03 AM
Myrna van Lunteren made changes - 04/May/09 06:21 PM
Ole Solberg made changes - 11/May/09 09:31 AM
Dag H. Wanvik made changes - 29/Jun/09 10:43 PM
Dag H. Wanvik made changes - 29/Jun/09 11:21 PM
Kathey Marsden made changes - 16/Jul/09 09:24 PM
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Removing the use of batch update is one alternative to expose the real error. However, that will change the timing and may hide this failure if it is timing-sensitive. Another alternative is to add a wrapper around executeBatch() and call getException() in the test. I have also noticed that in embedded mode, the underlying exception is available with getCause() and is therefore reported by JUnit, whereas the client driver (where we see this problem now) does not do that. So a third option is to make the client driver report these errors the same way as the embedded driver.