Issue Details (XML | Word | Printable)

Key: DERBY-3719
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Dag H. Wanvik
Reporter: Ole Solberg
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Derby

'...replication.buffer.LogBufferFullException' causes failover to fail w/ 'XRE07, SQLERRMC: Could not perform operation because the database is not in replication master mode.'

Created: 11/Jun/08 06:52 AM   Updated: 16/Jul/09 09:24 PM
Component/s: Replication
Affects Version/s: 10.4.2.0, 10.5.1.1
Fix Version/s: 10.5.2.0, 10.6.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
GZip Archive 12.tar.gz 2008-06-11 06:55 AM Ole Solberg 646 kB
File Licensed for inclusion in ASF works derby-3719-1.diff 2009-05-16 06:36 PM Dag H. Wanvik 0.7 kB
File Licensed for inclusion in ASF works traceLogShipping.diff 2009-05-15 11:12 PM Dag H. Wanvik 8 kB
File Licensed for inclusion in ASF works traceLogShipping.stat 2009-05-15 11:12 PM Dag H. Wanvik 0.4 kB
Environment:
HW: 2 X i86pc i386 (AMD Opteron(tm) Processor 252): 2593 MHz, unknown cache. 3968 Megabytes Total Memory.
OS: Solaris 10 5/08 s10x_u5wos_10 X86 64bits - SunOS 5.10 Generic_127128-11
JVM: Sun Microsystems Inc.
    java version "1.6.0_06"
    Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
    Java HotSpot(TM) Client VM (build 10.0-b22, mixed mode)
Issue Links:
Duplicate
 
Reference
 

Resolution Date: 09/Jun/09 01:25 PM
Labels:


 Description  « Hide
With the patch for DERBY-3709, derby-3709_p1-v2.diff.txt, I was able to provoke this error twice in 30 test runs on this platform (On another platform I saw none in 100 test runs.)

I will upload the full test run log dir.

"Summary":

1) testReplication_Local_StateTest_part2(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part2)junit.framework.ComparisonFailure: Unexpected SQL state. expected:<XRE[20]> but was:<XRE[07]>



Master derby.log:
-----------------------------------------
---- BEGIN REPLICATION ERROR MESSAGE (6/10/08 4:08 PM) ----
Exception occurred during log shipping.
org.apache.derby.impl.store.replication.buffer.LogBufferFullException
at org.apache.derby.impl.store.replication.buffer.ReplicationLogBuffer.switchDirtyBuffer(ReplicationLogBuffer.java:357)
at org.apache.derby.impl.store.replication.buffer.ReplicationLogBuffer.appendLog(ReplicationLogBuffer.java:146)
at org.apache.derby.impl.store.replication.master.MasterController.appendLog(MasterController.java:428)
at org.apache.derby.impl.store.raw.log.LogAccessFile.writeToLog(LogAccessFile.java:787)
at org.apache.derby.impl.store.raw.log.LogAccessFile.flushDirtyBuffers(LogAccessFile.java:534)
at org.apache.derby.impl.store.raw.log.LogAccessFile.flushLogAccessFile(LogAccessFile.java:574)
at org.apache.derby.impl.store.raw.log.LogAccessFile.writeLogRecord(LogAccessFile.java:332)
at org.apache.derby.impl.store.raw.log.LogToFile.appendLogRecord(LogToFile.java:3759)
at org.apache.derby.impl.store.raw.log.FileLogger.logAndDo(FileLogger.java:370)
at org.apache.derby.impl.store.raw.xact.Xact.logAndDo(Xact.java:1193)
at org.apache.derby.impl.store.raw.data.LoggableActions.doAction(LoggableActions.java:221)
at org.apache.derby.impl.store.raw.data.LoggableActions.actionUpdate(LoggableActions.java:85)
at org.apache.derby.impl.store.raw.data.StoredPage.doUpdateAtSlot(StoredPage.java:8463)
at org.apache.derby.impl.store.raw.data.StoredPage.updateOverflowDetails(StoredPage.java:8336)
at org.apache.derby.impl.store.raw.data.StoredPage.updateOverflowDetails(StoredPage.java:8319)
at org.apache.derby.impl.store.raw.data.BasePage.insertAllowOverflow(BasePage.java:808)
at org.apache.derby.impl.store.raw.data.BasePage.insert(BasePage.java:653)
at org.apache.derby.impl.store.access.heap.HeapController.doInsert(HeapController.java:307)
at org.apache.derby.impl.store.access.heap.HeapController.insert(HeapController.java:575)
at org.apache.derby.impl.sql.execute.RowChangerImpl.insertRow(RowChangerImpl.java:457)
at org.apache.derby.impl.sql.execute.InsertResultSet.normalInsertCore(InsertResultSet.java:1011)
at org.apache.derby.impl.sql.execute.InsertResultSet.open(InsertResultSet.java:487)
at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:384)
at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1235)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1652)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement.execute(EmbedPreparedStatement.java:1307)
at org.apache.derby.impl.drda.DRDAStatement.execute(DRDAStatement.java:672)
at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLSTTobjects(DRDAConnThread.java:4197)
at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLSTT(DRDAConnThread.java:4001)
at org.apache.derby.impl.drda.DRDAConnThread.processCommands(DRDAConnThread.java:991)
at org.apache.derby.impl.drda.DRDAConnThread.run(DRDAConnThread.java:278)

-------------------- END REPLICATION ERROR MESSAGE ---------------------


Slave derby.log:
-------------------------------------------------------------------------------------------
2008-06-10 14:05:56.408 GMT Thread[DRDAConnThread_3,5,main] (DATABASE = /export/home/tmp/os136789/testingInMyDerbySandbox/12/db_slave/wombat), (DRDAID = {2}), Replication slave mode started successfully for database '/export/home/tmp/os136789/testingInMyDerbySandbox/12/db_slave/wombat'. Connection refused because the database is in replication slave mode.
Replication slave role was stopped for database '/export/home/tmp/os136789/testingInMyDerbySandbox/12/db_slave/wombat'.

------------ BEGIN SHUTDOWN ERROR STACK -------------

ERROR XSLA7: Cannot redo operation null in the log.
at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:296)
at org.apache.derby.impl.store.raw.log.FileLogger.redo(FileLogger.java:1525)
at org.apache.derby.impl.store.raw.log.LogToFile.recover(LogToFile.java:920)
at org.apache.derby.impl.store.raw.RawStore.boot(RawStore.java:334)
at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:1999)
at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:291)
at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:553)
at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:427)
at org.apache.derby.impl.store.access.RAMAccessManager.boot(RAMAccessManager.java:1019)
at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:1999)
at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:291)
at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:553)
at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:427)
at org.apache.derby.impl.db.BasicDatabase.bootStore(BasicDatabase.java:780)
at org.apache.derby.impl.db.BasicDatabase.boot(BasicDatabase.java:196)
at org.apache.derby.impl.db.SlaveDatabase.bootBasicDatabase(SlaveDatabase.java:424)
at org.apache.derby.impl.db.SlaveDatabase.access$000(SlaveDatabase.java:70)
at org.apache.derby.impl.db.SlaveDatabase$SlaveDatabaseBootThread.run(SlaveDatabase.java:311)
at java.lang.Thread.run(Thread.java:619)
Caused by: ERROR 08006: Database '{0}' shutdown.
at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:276)
at org.apache.derby.impl.store.raw.log.LogToFile.stopReplicationSlaveRole(LogToFile.java:5142)
at org.apache.derby.impl.store.replication.slave.SlaveController.stopSlave(SlaveController.java:266)
at org.apache.derby.impl.store.replication.slave.SlaveController.access$500(SlaveController.java:64)
at org.apache.derby.impl.store.replication.slave.SlaveController$SlaveLogReceiverThread.run(SlaveController.java:531)
============= begin nested exception, level (1) ===========
ERROR 08006: Database '{0}' shutdown.
at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:276)
at org.apache.derby.impl.store.raw.log.LogToFile.stopReplicationSlaveRole(LogToFile.java:5142)
at org.apache.derby.impl.store.replication.slave.SlaveController.stopSlave(SlaveController.java:266)
at org.apache.derby.impl.store.replication.slave.SlaveController.access$500(SlaveController.java:64)
at org.apache.derby.impl.store.replication.slave.SlaveController$SlaveLogReceiverThread.run(SlaveController.java:531)
============= end nested exception, level (1) ===========


------------ END SHUTDOWN ERROR STACK -------------


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Ole Solberg made changes - 11/Jun/08 06:53 AM
Field Original Value New Value
Link This issue relates to DERBY-3709 [ DERBY-3709 ]
Ole Solberg added a comment - 11/Jun/08 06:55 AM
Full test run log.

Ole Solberg made changes - 11/Jun/08 06:55 AM
Attachment 12.tar.gz [ 12383807 ]
Kathey Marsden added a comment - 06/Apr/09 05:22 PM
I saw a similar issue in 10.4.2.0 -> 10.5.1.0 hard upgrade testing, which I am assuming is another manifestation of this bug.
1) testReplication_Local_StateTest_part1(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1)junit.framework.ComparisonFailure: Unexpected SQL state. expected:<...20> but was:<...07>
at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertSQLState(BaseJDBCTestCase.java:760)
at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertSQLState(BaseJDBCTestCase.java:809)
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver_direct(ReplicationRun.java:1381)
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver(ReplicationRun.java:1302)
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1.testReplication_Local_StateTest_part1(ReplicationRun_Local_StateTest_part1.java:160)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:105)
at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22)
at junit.extensions.TestSetup$1.protect(TestSetup.java:19)
at junit.extensions.TestSetup.run(TestSetup.java:23)
Caused by: java.sql.SQLException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XRE07, SQLERRMC: Could not perform operation because the database is not in replication master mode.
at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(Unknown Source)
at org.apache.derby.client.am.SqlException.getSQLException(Unknown Source)
at org.apache.derby.jdbc.ClientDriver.connect(Unknown Source)
at java.sql.DriverManager.getConnection(DriverManager.java:316)
at java.sql.DriverManager.getConnection(DriverManager.java:273)
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver_direct(ReplicationRun.java:1368)
... 28 more
Caused by: org.apache.derby.client.am.SqlException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XRE07, SQLERRMC: Could not perform operation because the database is not in replication master mode.
at org.apache.derby.client.am.Connection.completeSqlca(Unknown Source)
at org.apache.derby.client.net.NetConnectionReply.parseRdbAccessFailed(Unknown Source)
at org.apache.derby.client.net.NetConnectionReply.parseAccessRdbError(Unknown Source)
at org.apache.derby.client.net.NetConnectionReply.parseACCRDBreply(Unknown Source)
at org.apache.derby.client.net.NetConnectionReply.readAccessDatabase(Unknown Source)
at org.apache.derby.client.net.NetConnection.readSecurityCheckAndAccessRdb(Unknown Source)
at org.apache.derby.client.net.NetConnection.flowSecurityCheckAndAccessRdb(Unknown Source)
at org.apache.derby.client.net.NetConnection.flowUSRIDONLconnect(Unknown Source)
at org.apache.derby.client.net.NetConnection.flowConnect(Unknown Source)
at org.apache.derby.client.net.NetConnection.<init>(Unknown Source)
at org.apache.derby.client.net.NetConnection40.<init>(Unknown Source)
at org.apache.derby.client.net.ClientJDBCObjectFactoryImpl40.newNetConnection(Unknown Source)
... 32 more


Myrna van Lunteren made changes - 04/May/09 06:22 PM
Affects Version/s 10.5.1.1 [ 12313771 ]
Affects Version/s 10.5.0.0 [ 12313010 ]
Ole Solberg made changes - 15/May/09 09:36 AM
Link This issue relates to DERBY-3632 [ DERBY-3632 ]
Dag H. Wanvik made changes - 15/May/09 01:57 PM
Link This issue is duplicated by DERBY-4231 [ DERBY-4231 ]
Dag H. Wanvik added a comment - 15/May/09 11:12 PM
This log from master's log file shows what happens. The output is
produced by the patch attached (traceLogShipping):

@1242410514203 Sending done
@1242410514205 >= FI_HIGH
@1242410514206 >= FI_HIGH
@1242410514204 Sending
@1242410514208 >= FI_HIGH
@1242410514211 >= FI_HIGH
@1242410514216 log buffer full, try to force flush
@1242410514216 forceflush
@1242410514265 Sending done
@1242410514266 log buffer full, force failed
---- BEGIN REPLICATION ERROR MESSAGE (15.05.09 20:01) ----
@1242410514267 Sending
@1242410514286 Sending done
@1242410514286 Sending
Exception occurred during log shipping.
org.apache.derby.impl.store.replication.buffer.LogBufferFullException
at org.apache.derby.impl.store.replication.buffer.ReplicationLogBuffer.switchDirtyBuffer(ReplicationLogBuffer.java:357)

The asynchronous log shipper basically does this loops:

       while (true) {
            ship a log chunk
            
            if ! <things are busy>
                wait(shippingInterval)
            fi
      }

From derby.log, we see that the sending of a chunk starts a instant
4204, and sending is complete at 4265.

In the meantime, the user thread is busy writing log using
ReplicationLogBuffer.appendLog. Now, the buffer is getting full as
seen in the "instant 4208: >= FI_HIGH" (ReplicationLogBuffer calls
switchDirtyBuffer which will return a free buffer if there is one,
then it calls MasterController.workToDo to make sure the shipping
thread knows).

Notice the send is still waiting to complete. Now, another log write
happens, and again, we see this indication that we are close to having
0 free buffers (instant 4211: >= FI_HIGH). This is the second time workToDo is
called. In both cases, AsynchronousLogShipper.workToDo does a notify
to try to wake up the thread on the assumption is may be sleeping in
the "wait(shippingInterval)" seen above. But in reality, the shipping
thread is still waiting for the currently active thread to finish its
sending, so this has no effect.

Now, have a look at MasterController.appendLog (which calls
ReplicationLogBuffer.appendLog). Notice that is it gets a
LogBufferFullException it will try to force the log shipper to flush,
and then retry the append on the assumption that at least one free
buffer has been returned to the pool. Now, have a look at the code of
ASL.forceFlush (called at instant 4216). What this code does is to try
to wake up the sleeping shipper thread with the call to
notify(). Sadly, the shipping thread is still not finished with its
write (it takes 4265-4204= 61 ms), so the forceflush just returns and
allows MasterController.appendLog to fail for the second time. And
this time the LogBufferFullException is the kiss of death (4266 log
buffer full, force failed).

Notice how the shipping thread still thinks all is hunky dory, it
starts a new ship at instant 4267, but, alas it is now too late, since
the master thread has given up, and started to tear down.

So, in conclusion, the logic to force log shipping before we attempt a
retry of the log append is flawed.

Dag H. Wanvik made changes - 15/May/09 11:12 PM
Attachment traceLogShipping.diff [ 12408289 ]
Attachment traceLogShipping.stat [ 12408290 ]
Dag H. Wanvik added a comment - 16/May/09 06:36 PM
Uploading patch derby-3719-1. This moves the actual log send to inside
the forceFlushSemaphore monitor. The effect of this in the failure
scenario, (that is, if the user thread called forceFlush after a send
has been initiated), is to hold back the user thread doing forceFlush
till the log shipper thread has finished its send. That way, when the
forceFlush will not return until at least (*) ONE sending operation
has been initiated and completed, ensuring that at least one buffer
has been returned to the free pool. This in turn leads the 2nd attempt
(after receiving a LogBufferFullException) to be able to append the
log in MasterController.appendLog.

(*) the log shipper thread could possibly race past the user thread
and send more than once, but that would not be harmful because the
sending thread would ultimately call notify again, allowing the user
thread to continue and find free buffers.

This trace fragment from db_master/derby.log (with patch of trace patch applied) of the master
in a tight spot (when running ReplicationRun_Local_StateTest_part1), shows
how the sequence of events change with the patch:

@1242436667589 Sending
@1242436667590 Sending done
@1242436667590 ship sleep 100
@1242436667667 >= FI_HIGH
@1242436667668 Sending
@1242436667668 >= FI_HIGH
@1242436667670 >= FI_HIGH
@1242436667673 >= FI_HIGH
@1242436667676 >= FI_HIGH
@1242436667679 log buffer full, try to force flush
@1242436667679 forceflush
@1242436667695 Sending done
@1242436667696 Sending
@1242436667696 Sending done
@1242436667696 Sending

Sending takes somewhat long here, (7695 ms - 7668 ms = 27ms) and the
user thread finds few free buffers left, and then finally none and
goes on to force a flush. But with the patch, the call to forceflush
at instant 7679 must wait till the shiper thread's send is done; at
instant 7695. Since the shipper thread has held the monitor since
before the instant it released a free buffer (implying a the user
thread can not have been able to grab it yet!), by the time the user
thread gets the monitor on forceFlushSemaphore, the sender is done,
and a free buffer is guaranteed to have been returned to the pool.

I have run ReplicationRun_Local_StateTest_part1 now for 24 hours
without seeing a problem with it (ca 350 runs). Running full
regressions now.

Ready for review.

Dag H. Wanvik made changes - 16/May/09 06:36 PM
Attachment derby-3719-1.diff [ 12408320 ]
Dag H. Wanvik made changes - 16/May/09 06:36 PM
Assignee Dag H. Wanvik [ dagw ]
Dag H. Wanvik made changes - 16/May/09 06:38 PM
Fix Version/s 10.6.0.0 [ 12313727 ]
Derby Info [Patch Available]
Repository Revision Date User Message
ASF #782991 Tue Jun 09 13:09:10 UTC 2009 dag DERBY-3719 ...replication.buffer.LogBufferFullException' causes failover to fail w/ 'XRE07, SQLERRMC: Could not perform operation because the database is not in replication master mode.

Patch DERBY-3719-1 fixes a race condition in the logic that throttles
log production if shipping can't keep up. The bug made replication stop.
Files Changed
MODIFY /db/derby/code/trunk/java/engine/org/apache/derby/impl/store/replication/master/AsynchronousLogShipper.java

Dag H. Wanvik added a comment - 09/Jun/09 01:10 PM
Committed patch derby-3719-1 as svn 782991.

Dag H. Wanvik made changes - 09/Jun/09 01:10 PM
Fix Version/s 10.5.1.2 [ 12313870 ]
Derby Info [Patch Available]
Repository Revision Date User Message
ASF #782997 Tue Jun 09 13:24:13 UTC 2009 dag DERBY-3719 ...replication.buffer.LogBufferFullException' causes failover to fail w/ 'XRE07, SQLERRMC: Could not perform operation because the database is not in replication master mode.

Backported fix from trunk as

svn merge -c 782991 https://svn.eu.apache.org/repos/asf/db/derby/code/trunk
Files Changed
MODIFY /db/derby/code/branches/10.5/java/engine/org/apache/derby/impl/store/replication/master/AsynchronousLogShipper.java
MODIFY /db/derby/code/branches/10.5

Dag H. Wanvik added a comment - 09/Jun/09 01:25 PM
Backported this fix to 10.5 branch; committed as svn 782997.
Resolving.


Dag H. Wanvik made changes - 09/Jun/09 01:25 PM
Status Open [ 1 ] Resolved [ 5 ]
Resolution Fixed [ 1 ]
Ole Solberg added a comment - 04/Jul/09 07:42 AM
Not seen since commit, so closing.

Ole Solberg made changes - 04/Jul/09 07:42 AM
Status Resolved [ 5 ] Closed [ 6 ]
Kathey Marsden made changes - 16/Jul/09 09:24 PM
Fix Version/s 10.5.2.0 [ 12314116 ]
Fix Version/s 10.5.1.2 [ 12313870 ]