|
Ole Solberg made changes - 11/Jun/08 06:53 AM
[
Permalink
| « Hide
]
Ole Solberg added a comment - 11/Jun/08 06:55 AM
Full test run log.
Ole Solberg made changes - 11/Jun/08 06:55 AM
I saw a similar issue in 10.4.2.0 -> 10.5.1.0 hard upgrade testing, which I am assuming is another manifestation of this bug.
1) testReplication_Local_StateTest_part1(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1)junit.framework.ComparisonFailure: Unexpected SQL state. expected:<...20> but was:<...07> at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertSQLState(BaseJDBCTestCase.java:760) at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertSQLState(BaseJDBCTestCase.java:809) at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver_direct(ReplicationRun.java:1381) at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver(ReplicationRun.java:1302) at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1.testReplication_Local_StateTest_part1(ReplicationRun_Local_StateTest_part1.java:160) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:105) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22) at junit.extensions.TestSetup$1.protect(TestSetup.java:19) at junit.extensions.TestSetup.run(TestSetup.java:23) Caused by: java.sql.SQLException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XRE07, SQLERRMC: Could not perform operation because the database is not in replication master mode. at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.client.am.SqlException.getSQLException(Unknown Source) at org.apache.derby.jdbc.ClientDriver.connect(Unknown Source) at java.sql.DriverManager.getConnection(DriverManager.java:316) at java.sql.DriverManager.getConnection(DriverManager.java:273) at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver_direct(ReplicationRun.java:1368) ... 28 more Caused by: org.apache.derby.client.am.SqlException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XRE07, SQLERRMC: Could not perform operation because the database is not in replication master mode. at org.apache.derby.client.am.Connection.completeSqlca(Unknown Source) at org.apache.derby.client.net.NetConnectionReply.parseRdbAccessFailed(Unknown Source) at org.apache.derby.client.net.NetConnectionReply.parseAccessRdbError(Unknown Source) at org.apache.derby.client.net.NetConnectionReply.parseACCRDBreply(Unknown Source) at org.apache.derby.client.net.NetConnectionReply.readAccessDatabase(Unknown Source) at org.apache.derby.client.net.NetConnection.readSecurityCheckAndAccessRdb(Unknown Source) at org.apache.derby.client.net.NetConnection.flowSecurityCheckAndAccessRdb(Unknown Source) at org.apache.derby.client.net.NetConnection.flowUSRIDONLconnect(Unknown Source) at org.apache.derby.client.net.NetConnection.flowConnect(Unknown Source) at org.apache.derby.client.net.NetConnection.<init>(Unknown Source) at org.apache.derby.client.net.NetConnection40.<init>(Unknown Source) at org.apache.derby.client.net.ClientJDBCObjectFactoryImpl40.newNetConnection(Unknown Source) ... 32 more
Myrna van Lunteren made changes - 04/May/09 06:22 PM
Ole Solberg made changes - 15/May/09 09:36 AM
Dag H. Wanvik made changes - 15/May/09 01:57 PM
This log from master's log file shows what happens. The output is
produced by the patch attached (traceLogShipping): @1242410514203 Sending done @1242410514205 >= FI_HIGH @1242410514206 >= FI_HIGH @1242410514204 Sending @1242410514208 >= FI_HIGH @1242410514211 >= FI_HIGH @1242410514216 log buffer full, try to force flush @1242410514216 forceflush @1242410514265 Sending done @1242410514266 log buffer full, force failed ---- BEGIN REPLICATION ERROR MESSAGE (15.05.09 20:01) ---- @1242410514267 Sending @1242410514286 Sending done @1242410514286 Sending Exception occurred during log shipping. org.apache.derby.impl.store.replication.buffer.LogBufferFullException at org.apache.derby.impl.store.replication.buffer.ReplicationLogBuffer.switchDirtyBuffer(ReplicationLogBuffer.java:357) The asynchronous log shipper basically does this loops: while (true) { ship a log chunk if ! <things are busy> wait(shippingInterval) fi } From derby.log, we see that the sending of a chunk starts a instant 4204, and sending is complete at 4265. In the meantime, the user thread is busy writing log using ReplicationLogBuffer.appendLog. Now, the buffer is getting full as seen in the "instant 4208: >= FI_HIGH" (ReplicationLogBuffer calls switchDirtyBuffer which will return a free buffer if there is one, then it calls MasterController.workToDo to make sure the shipping thread knows). Notice the send is still waiting to complete. Now, another log write happens, and again, we see this indication that we are close to having 0 free buffers (instant 4211: >= FI_HIGH). This is the second time workToDo is called. In both cases, AsynchronousLogShipper.workToDo does a notify to try to wake up the thread on the assumption is may be sleeping in the "wait(shippingInterval)" seen above. But in reality, the shipping thread is still waiting for the currently active thread to finish its sending, so this has no effect. Now, have a look at MasterController.appendLog (which calls ReplicationLogBuffer.appendLog). Notice that is it gets a LogBufferFullException it will try to force the log shipper to flush, and then retry the append on the assumption that at least one free buffer has been returned to the pool. Now, have a look at the code of ASL.forceFlush (called at instant 4216). What this code does is to try to wake up the sleeping shipper thread with the call to notify(). Sadly, the shipping thread is still not finished with its write (it takes 4265-4204= 61 ms), so the forceflush just returns and allows MasterController.appendLog to fail for the second time. And this time the LogBufferFullException is the kiss of death (4266 log buffer full, force failed). Notice how the shipping thread still thinks all is hunky dory, it starts a new ship at instant 4267, but, alas it is now too late, since the master thread has given up, and started to tear down. So, in conclusion, the logic to force log shipping before we attempt a retry of the log append is flawed.
Dag H. Wanvik made changes - 15/May/09 11:12 PM
Uploading patch derby-3719-1. This moves the actual log send to inside
the forceFlushSemaphore monitor. The effect of this in the failure scenario, (that is, if the user thread called forceFlush after a send has been initiated), is to hold back the user thread doing forceFlush till the log shipper thread has finished its send. That way, when the forceFlush will not return until at least (*) ONE sending operation has been initiated and completed, ensuring that at least one buffer has been returned to the free pool. This in turn leads the 2nd attempt (after receiving a LogBufferFullException) to be able to append the log in MasterController.appendLog. (*) the log shipper thread could possibly race past the user thread and send more than once, but that would not be harmful because the sending thread would ultimately call notify again, allowing the user thread to continue and find free buffers. This trace fragment from db_master/derby.log (with patch of trace patch applied) of the master in a tight spot (when running ReplicationRun_Local_StateTest_part1), shows how the sequence of events change with the patch: @1242436667589 Sending @1242436667590 Sending done @1242436667590 ship sleep 100 @1242436667667 >= FI_HIGH @1242436667668 Sending @1242436667668 >= FI_HIGH @1242436667670 >= FI_HIGH @1242436667673 >= FI_HIGH @1242436667676 >= FI_HIGH @1242436667679 log buffer full, try to force flush @1242436667679 forceflush @1242436667695 Sending done @1242436667696 Sending @1242436667696 Sending done @1242436667696 Sending Sending takes somewhat long here, (7695 ms - 7668 ms = 27ms) and the user thread finds few free buffers left, and then finally none and goes on to force a flush. But with the patch, the call to forceflush at instant 7679 must wait till the shiper thread's send is done; at instant 7695. Since the shipper thread has held the monitor since before the instant it released a free buffer (implying a the user thread can not have been able to grab it yet!), by the time the user thread gets the monitor on forceFlushSemaphore, the sender is done, and a free buffer is guaranteed to have been returned to the pool. I have run ReplicationRun_Local_StateTest_part1 now for 24 hours without seeing a problem with it (ca 350 runs). Running full regressions now. Ready for review.
Dag H. Wanvik made changes - 16/May/09 06:36 PM
Dag H. Wanvik made changes - 16/May/09 06:36 PM
Dag H. Wanvik made changes - 16/May/09 06:38 PM
Committed patch derby-3719-1 as svn 782991.
Dag H. Wanvik made changes - 09/Jun/09 01:10 PM
Backported this fix to 10.5 branch; committed as svn 782997.
Resolving.
Dag H. Wanvik made changes - 09/Jun/09 01:25 PM
Not seen since commit, so closing.
Ole Solberg made changes - 04/Jul/09 07:42 AM
Kathey Marsden made changes - 16/Jul/09 09:24 PM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||