Derby
  1. Derby
  2. DERBY-3719

'...replication.buffer.LogBufferFullException' causes failover to fail w/ 'XRE07, SQLERRMC: Could not perform operation because the database is not in replication master mode.'

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.4.2.0, 10.5.1.1
    • Fix Version/s: 10.5.2.0, 10.6.1.0
    • Component/s: Replication
    • Labels:
      None
    • Environment:

      Description

      With the patch for DERBY-3709, derby-3709_p1-v2.diff.txt, I was able to provoke this error twice in 30 test runs on this platform (On another platform I saw none in 100 test runs.)

      I will upload the full test run log dir.

      "Summary":

      1) testReplication_Local_StateTest_part2(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part2)junit.framework.ComparisonFailure: Unexpected SQL state. expected:<XRE[20]> but was:<XRE[07]>

      Master derby.log:
      -----------------------------------------
      ---- BEGIN REPLICATION ERROR MESSAGE (6/10/08 4:08 PM) ----
      Exception occurred during log shipping.
      org.apache.derby.impl.store.replication.buffer.LogBufferFullException
      at org.apache.derby.impl.store.replication.buffer.ReplicationLogBuffer.switchDirtyBuffer(ReplicationLogBuffer.java:357)
      at org.apache.derby.impl.store.replication.buffer.ReplicationLogBuffer.appendLog(ReplicationLogBuffer.java:146)
      at org.apache.derby.impl.store.replication.master.MasterController.appendLog(MasterController.java:428)
      at org.apache.derby.impl.store.raw.log.LogAccessFile.writeToLog(LogAccessFile.java:787)
      at org.apache.derby.impl.store.raw.log.LogAccessFile.flushDirtyBuffers(LogAccessFile.java:534)
      at org.apache.derby.impl.store.raw.log.LogAccessFile.flushLogAccessFile(LogAccessFile.java:574)
      at org.apache.derby.impl.store.raw.log.LogAccessFile.writeLogRecord(LogAccessFile.java:332)
      at org.apache.derby.impl.store.raw.log.LogToFile.appendLogRecord(LogToFile.java:3759)
      at org.apache.derby.impl.store.raw.log.FileLogger.logAndDo(FileLogger.java:370)
      at org.apache.derby.impl.store.raw.xact.Xact.logAndDo(Xact.java:1193)
      at org.apache.derby.impl.store.raw.data.LoggableActions.doAction(LoggableActions.java:221)
      at org.apache.derby.impl.store.raw.data.LoggableActions.actionUpdate(LoggableActions.java:85)
      at org.apache.derby.impl.store.raw.data.StoredPage.doUpdateAtSlot(StoredPage.java:8463)
      at org.apache.derby.impl.store.raw.data.StoredPage.updateOverflowDetails(StoredPage.java:8336)
      at org.apache.derby.impl.store.raw.data.StoredPage.updateOverflowDetails(StoredPage.java:8319)
      at org.apache.derby.impl.store.raw.data.BasePage.insertAllowOverflow(BasePage.java:808)
      at org.apache.derby.impl.store.raw.data.BasePage.insert(BasePage.java:653)
      at org.apache.derby.impl.store.access.heap.HeapController.doInsert(HeapController.java:307)
      at org.apache.derby.impl.store.access.heap.HeapController.insert(HeapController.java:575)
      at org.apache.derby.impl.sql.execute.RowChangerImpl.insertRow(RowChangerImpl.java:457)
      at org.apache.derby.impl.sql.execute.InsertResultSet.normalInsertCore(InsertResultSet.java:1011)
      at org.apache.derby.impl.sql.execute.InsertResultSet.open(InsertResultSet.java:487)
      at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:384)
      at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1235)
      at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1652)
      at org.apache.derby.impl.jdbc.EmbedPreparedStatement.execute(EmbedPreparedStatement.java:1307)
      at org.apache.derby.impl.drda.DRDAStatement.execute(DRDAStatement.java:672)
      at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLSTTobjects(DRDAConnThread.java:4197)
      at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLSTT(DRDAConnThread.java:4001)
      at org.apache.derby.impl.drda.DRDAConnThread.processCommands(DRDAConnThread.java:991)
      at org.apache.derby.impl.drda.DRDAConnThread.run(DRDAConnThread.java:278)

      -------------------- END REPLICATION ERROR MESSAGE ---------------------

      Slave derby.log:
      -------------------------------------------------------------------------------------------
      2008-06-10 14:05:56.408 GMT Thread[DRDAConnThread_3,5,main] (DATABASE = /export/home/tmp/os136789/testingInMyDerbySandbox/12/db_slave/wombat), (DRDAID =

      {2}

      ), Replication slave mode started successfully for database '/export/home/tmp/os136789/testingInMyDerbySandbox/12/db_slave/wombat'. Connection refused because the database is in replication slave mode.
      Replication slave role was stopped for database '/export/home/tmp/os136789/testingInMyDerbySandbox/12/db_slave/wombat'.

      ------------ BEGIN SHUTDOWN ERROR STACK -------------

      ERROR XSLA7: Cannot redo operation null in the log.
      at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:296)
      at org.apache.derby.impl.store.raw.log.FileLogger.redo(FileLogger.java:1525)
      at org.apache.derby.impl.store.raw.log.LogToFile.recover(LogToFile.java:920)
      at org.apache.derby.impl.store.raw.RawStore.boot(RawStore.java:334)
      at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:1999)
      at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:291)
      at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:553)
      at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:427)
      at org.apache.derby.impl.store.access.RAMAccessManager.boot(RAMAccessManager.java:1019)
      at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:1999)
      at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:291)
      at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:553)
      at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:427)
      at org.apache.derby.impl.db.BasicDatabase.bootStore(BasicDatabase.java:780)
      at org.apache.derby.impl.db.BasicDatabase.boot(BasicDatabase.java:196)
      at org.apache.derby.impl.db.SlaveDatabase.bootBasicDatabase(SlaveDatabase.java:424)
      at org.apache.derby.impl.db.SlaveDatabase.access$000(SlaveDatabase.java:70)
      at org.apache.derby.impl.db.SlaveDatabase$SlaveDatabaseBootThread.run(SlaveDatabase.java:311)
      at java.lang.Thread.run(Thread.java:619)
      Caused by: ERROR 08006: Database '

      {0}' shutdown.
      at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:276)
      at org.apache.derby.impl.store.raw.log.LogToFile.stopReplicationSlaveRole(LogToFile.java:5142)
      at org.apache.derby.impl.store.replication.slave.SlaveController.stopSlave(SlaveController.java:266)
      at org.apache.derby.impl.store.replication.slave.SlaveController.access$500(SlaveController.java:64)
      at org.apache.derby.impl.store.replication.slave.SlaveController$SlaveLogReceiverThread.run(SlaveController.java:531)
      ============= begin nested exception, level (1) ===========
      ERROR 08006: Database '{0}

      ' shutdown.
      at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:276)
      at org.apache.derby.impl.store.raw.log.LogToFile.stopReplicationSlaveRole(LogToFile.java:5142)
      at org.apache.derby.impl.store.replication.slave.SlaveController.stopSlave(SlaveController.java:266)
      at org.apache.derby.impl.store.replication.slave.SlaveController.access$500(SlaveController.java:64)
      at org.apache.derby.impl.store.replication.slave.SlaveController$SlaveLogReceiverThread.run(SlaveController.java:531)
      ============= end nested exception, level (1) ===========

      ------------ END SHUTDOWN ERROR STACK -------------

      1. 12.tar.gz
        646 kB
        Ole Solberg
      2. derby-3719-1.diff
        0.7 kB
        Dag H. Wanvik
      3. traceLogShipping.diff
        8 kB
        Dag H. Wanvik
      4. traceLogShipping.stat
        0.4 kB
        Dag H. Wanvik

        Issue Links

          Activity

          Hide
          Ole Solberg added a comment -

          Full test run log.

          Show
          Ole Solberg added a comment - Full test run log.
          Hide
          Kathey Marsden added a comment -

          I saw a similar issue in 10.4.2.0 -> 10.5.1.0 hard upgrade testing, which I am assuming is another manifestation of this bug.
          1) testReplication_Local_StateTest_part1(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1)junit.framework.ComparisonFailure: Unexpected SQL state. expected:<...20> but was:<...07>
          at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertSQLState(BaseJDBCTestCase.java:760)
          at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertSQLState(BaseJDBCTestCase.java:809)
          at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver_direct(ReplicationRun.java:1381)
          at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver(ReplicationRun.java:1302)
          at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1.testReplication_Local_StateTest_part1(ReplicationRun_Local_StateTest_part1.java:160)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
          at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:105)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:19)
          at junit.extensions.TestSetup.run(TestSetup.java:23)
          Caused by: java.sql.SQLException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XRE07, SQLERRMC: Could not perform operation because the database is not in replication master mode.
          at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(Unknown Source)
          at org.apache.derby.client.am.SqlException.getSQLException(Unknown Source)
          at org.apache.derby.jdbc.ClientDriver.connect(Unknown Source)
          at java.sql.DriverManager.getConnection(DriverManager.java:316)
          at java.sql.DriverManager.getConnection(DriverManager.java:273)
          at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver_direct(ReplicationRun.java:1368)
          ... 28 more
          Caused by: org.apache.derby.client.am.SqlException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XRE07, SQLERRMC: Could not perform operation because the database is not in replication master mode.
          at org.apache.derby.client.am.Connection.completeSqlca(Unknown Source)
          at org.apache.derby.client.net.NetConnectionReply.parseRdbAccessFailed(Unknown Source)
          at org.apache.derby.client.net.NetConnectionReply.parseAccessRdbError(Unknown Source)
          at org.apache.derby.client.net.NetConnectionReply.parseACCRDBreply(Unknown Source)
          at org.apache.derby.client.net.NetConnectionReply.readAccessDatabase(Unknown Source)
          at org.apache.derby.client.net.NetConnection.readSecurityCheckAndAccessRdb(Unknown Source)
          at org.apache.derby.client.net.NetConnection.flowSecurityCheckAndAccessRdb(Unknown Source)
          at org.apache.derby.client.net.NetConnection.flowUSRIDONLconnect(Unknown Source)
          at org.apache.derby.client.net.NetConnection.flowConnect(Unknown Source)
          at org.apache.derby.client.net.NetConnection.<init>(Unknown Source)
          at org.apache.derby.client.net.NetConnection40.<init>(Unknown Source)
          at org.apache.derby.client.net.ClientJDBCObjectFactoryImpl40.newNetConnection(Unknown Source)
          ... 32 more

          Show
          Kathey Marsden added a comment - I saw a similar issue in 10.4.2.0 -> 10.5.1.0 hard upgrade testing, which I am assuming is another manifestation of this bug. 1) testReplication_Local_StateTest_part1(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1)junit.framework.ComparisonFailure: Unexpected SQL state. expected:<...20> but was:<...07> at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertSQLState(BaseJDBCTestCase.java:760) at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertSQLState(BaseJDBCTestCase.java:809) at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver_direct(ReplicationRun.java:1381) at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver(ReplicationRun.java:1302) at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1.testReplication_Local_StateTest_part1(ReplicationRun_Local_StateTest_part1.java:160) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:105) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22) at junit.extensions.TestSetup$1.protect(TestSetup.java:19) at junit.extensions.TestSetup.run(TestSetup.java:23) Caused by: java.sql.SQLException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XRE07, SQLERRMC: Could not perform operation because the database is not in replication master mode. at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.client.am.SqlException.getSQLException(Unknown Source) at org.apache.derby.jdbc.ClientDriver.connect(Unknown Source) at java.sql.DriverManager.getConnection(DriverManager.java:316) at java.sql.DriverManager.getConnection(DriverManager.java:273) at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver_direct(ReplicationRun.java:1368) ... 28 more Caused by: org.apache.derby.client.am.SqlException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XRE07, SQLERRMC: Could not perform operation because the database is not in replication master mode. at org.apache.derby.client.am.Connection.completeSqlca(Unknown Source) at org.apache.derby.client.net.NetConnectionReply.parseRdbAccessFailed(Unknown Source) at org.apache.derby.client.net.NetConnectionReply.parseAccessRdbError(Unknown Source) at org.apache.derby.client.net.NetConnectionReply.parseACCRDBreply(Unknown Source) at org.apache.derby.client.net.NetConnectionReply.readAccessDatabase(Unknown Source) at org.apache.derby.client.net.NetConnection.readSecurityCheckAndAccessRdb(Unknown Source) at org.apache.derby.client.net.NetConnection.flowSecurityCheckAndAccessRdb(Unknown Source) at org.apache.derby.client.net.NetConnection.flowUSRIDONLconnect(Unknown Source) at org.apache.derby.client.net.NetConnection.flowConnect(Unknown Source) at org.apache.derby.client.net.NetConnection.<init>(Unknown Source) at org.apache.derby.client.net.NetConnection40.<init>(Unknown Source) at org.apache.derby.client.net.ClientJDBCObjectFactoryImpl40.newNetConnection(Unknown Source) ... 32 more
          Hide
          Dag H. Wanvik added a comment -

          This log from master's log file shows what happens. The output is
          produced by the patch attached (traceLogShipping):

          @1242410514203 Sending done
          @1242410514205 >= FI_HIGH
          @1242410514206 >= FI_HIGH
          @1242410514204 Sending
          @1242410514208 >= FI_HIGH
          @1242410514211 >= FI_HIGH
          @1242410514216 log buffer full, try to force flush
          @1242410514216 forceflush
          @1242410514265 Sending done
          @1242410514266 log buffer full, force failed
          ---- BEGIN REPLICATION ERROR MESSAGE (15.05.09 20:01) ----
          @1242410514267 Sending
          @1242410514286 Sending done
          @1242410514286 Sending
          Exception occurred during log shipping.
          org.apache.derby.impl.store.replication.buffer.LogBufferFullException
          at org.apache.derby.impl.store.replication.buffer.ReplicationLogBuffer.switchDirtyBuffer(ReplicationLogBuffer.java:357)

          The asynchronous log shipper basically does this loops:

          while (true)

          { ship a log chunk if ! <things are busy> wait(shippingInterval) fi }

          From derby.log, we see that the sending of a chunk starts a instant
          4204, and sending is complete at 4265.

          In the meantime, the user thread is busy writing log using
          ReplicationLogBuffer.appendLog. Now, the buffer is getting full as
          seen in the "instant 4208: >= FI_HIGH" (ReplicationLogBuffer calls
          switchDirtyBuffer which will return a free buffer if there is one,
          then it calls MasterController.workToDo to make sure the shipping
          thread knows).

          Notice the send is still waiting to complete. Now, another log write
          happens, and again, we see this indication that we are close to having
          0 free buffers (instant 4211: >= FI_HIGH). This is the second time workToDo is
          called. In both cases, AsynchronousLogShipper.workToDo does a notify
          to try to wake up the thread on the assumption is may be sleeping in
          the "wait(shippingInterval)" seen above. But in reality, the shipping
          thread is still waiting for the currently active thread to finish its
          sending, so this has no effect.

          Now, have a look at MasterController.appendLog (which calls
          ReplicationLogBuffer.appendLog). Notice that is it gets a
          LogBufferFullException it will try to force the log shipper to flush,
          and then retry the append on the assumption that at least one free
          buffer has been returned to the pool. Now, have a look at the code of
          ASL.forceFlush (called at instant 4216). What this code does is to try
          to wake up the sleeping shipper thread with the call to
          notify(). Sadly, the shipping thread is still not finished with its
          write (it takes 4265-4204= 61 ms), so the forceflush just returns and
          allows MasterController.appendLog to fail for the second time. And
          this time the LogBufferFullException is the kiss of death (4266 log
          buffer full, force failed).

          Notice how the shipping thread still thinks all is hunky dory, it
          starts a new ship at instant 4267, but, alas it is now too late, since
          the master thread has given up, and started to tear down.

          So, in conclusion, the logic to force log shipping before we attempt a
          retry of the log append is flawed.

          Show
          Dag H. Wanvik added a comment - This log from master's log file shows what happens. The output is produced by the patch attached (traceLogShipping): @1242410514203 Sending done @1242410514205 >= FI_HIGH @1242410514206 >= FI_HIGH @1242410514204 Sending @1242410514208 >= FI_HIGH @1242410514211 >= FI_HIGH @1242410514216 log buffer full, try to force flush @1242410514216 forceflush @1242410514265 Sending done @1242410514266 log buffer full, force failed ---- BEGIN REPLICATION ERROR MESSAGE (15.05.09 20:01) ---- @1242410514267 Sending @1242410514286 Sending done @1242410514286 Sending Exception occurred during log shipping. org.apache.derby.impl.store.replication.buffer.LogBufferFullException at org.apache.derby.impl.store.replication.buffer.ReplicationLogBuffer.switchDirtyBuffer(ReplicationLogBuffer.java:357) The asynchronous log shipper basically does this loops: while (true) { ship a log chunk if ! <things are busy> wait(shippingInterval) fi } From derby.log, we see that the sending of a chunk starts a instant 4204, and sending is complete at 4265. In the meantime, the user thread is busy writing log using ReplicationLogBuffer.appendLog. Now, the buffer is getting full as seen in the "instant 4208: >= FI_HIGH" (ReplicationLogBuffer calls switchDirtyBuffer which will return a free buffer if there is one, then it calls MasterController.workToDo to make sure the shipping thread knows). Notice the send is still waiting to complete. Now, another log write happens, and again, we see this indication that we are close to having 0 free buffers (instant 4211: >= FI_HIGH). This is the second time workToDo is called. In both cases, AsynchronousLogShipper.workToDo does a notify to try to wake up the thread on the assumption is may be sleeping in the "wait(shippingInterval)" seen above. But in reality, the shipping thread is still waiting for the currently active thread to finish its sending, so this has no effect. Now, have a look at MasterController.appendLog (which calls ReplicationLogBuffer.appendLog). Notice that is it gets a LogBufferFullException it will try to force the log shipper to flush, and then retry the append on the assumption that at least one free buffer has been returned to the pool. Now, have a look at the code of ASL.forceFlush (called at instant 4216). What this code does is to try to wake up the sleeping shipper thread with the call to notify(). Sadly, the shipping thread is still not finished with its write (it takes 4265-4204= 61 ms), so the forceflush just returns and allows MasterController.appendLog to fail for the second time. And this time the LogBufferFullException is the kiss of death (4266 log buffer full, force failed). Notice how the shipping thread still thinks all is hunky dory, it starts a new ship at instant 4267, but, alas it is now too late, since the master thread has given up, and started to tear down. So, in conclusion, the logic to force log shipping before we attempt a retry of the log append is flawed.
          Hide
          Dag H. Wanvik added a comment -

          Uploading patch derby-3719-1. This moves the actual log send to inside
          the forceFlushSemaphore monitor. The effect of this in the failure
          scenario, (that is, if the user thread called forceFlush after a send
          has been initiated), is to hold back the user thread doing forceFlush
          till the log shipper thread has finished its send. That way, when the
          forceFlush will not return until at least ONE sending operation
          has been initiated and completed, ensuring that at least one buffer
          has been returned to the free pool. This in turn leads the 2nd attempt
          (after receiving a LogBufferFullException) to be able to append the
          log in MasterController.appendLog.

          the log shipper thread could possibly race past the user thread
          and send more than once, but that would not be harmful because the
          sending thread would ultimately call notify again, allowing the user
          thread to continue and find free buffers.

          This trace fragment from db_master/derby.log (with patch of trace patch applied) of the master
          in a tight spot (when running ReplicationRun_Local_StateTest_part1), shows
          how the sequence of events change with the patch:

          @1242436667589 Sending
          @1242436667590 Sending done
          @1242436667590 ship sleep 100
          @1242436667667 >= FI_HIGH
          @1242436667668 Sending
          @1242436667668 >= FI_HIGH
          @1242436667670 >= FI_HIGH
          @1242436667673 >= FI_HIGH
          @1242436667676 >= FI_HIGH
          @1242436667679 log buffer full, try to force flush
          @1242436667679 forceflush
          @1242436667695 Sending done
          @1242436667696 Sending
          @1242436667696 Sending done
          @1242436667696 Sending

          Sending takes somewhat long here, (7695 ms - 7668 ms = 27ms) and the
          user thread finds few free buffers left, and then finally none and
          goes on to force a flush. But with the patch, the call to forceflush
          at instant 7679 must wait till the shiper thread's send is done; at
          instant 7695. Since the shipper thread has held the monitor since
          before the instant it released a free buffer (implying a the user
          thread can not have been able to grab it yet!), by the time the user
          thread gets the monitor on forceFlushSemaphore, the sender is done,
          and a free buffer is guaranteed to have been returned to the pool.

          I have run ReplicationRun_Local_StateTest_part1 now for 24 hours
          without seeing a problem with it (ca 350 runs). Running full
          regressions now.

          Ready for review.

          Show
          Dag H. Wanvik added a comment - Uploading patch derby-3719-1. This moves the actual log send to inside the forceFlushSemaphore monitor. The effect of this in the failure scenario, (that is, if the user thread called forceFlush after a send has been initiated), is to hold back the user thread doing forceFlush till the log shipper thread has finished its send. That way, when the forceFlush will not return until at least ONE sending operation has been initiated and completed, ensuring that at least one buffer has been returned to the free pool. This in turn leads the 2nd attempt (after receiving a LogBufferFullException) to be able to append the log in MasterController.appendLog. the log shipper thread could possibly race past the user thread and send more than once, but that would not be harmful because the sending thread would ultimately call notify again, allowing the user thread to continue and find free buffers. This trace fragment from db_master/derby.log (with patch of trace patch applied) of the master in a tight spot (when running ReplicationRun_Local_StateTest_part1), shows how the sequence of events change with the patch: @1242436667589 Sending @1242436667590 Sending done @1242436667590 ship sleep 100 @1242436667667 >= FI_HIGH @1242436667668 Sending @1242436667668 >= FI_HIGH @1242436667670 >= FI_HIGH @1242436667673 >= FI_HIGH @1242436667676 >= FI_HIGH @1242436667679 log buffer full, try to force flush @1242436667679 forceflush @1242436667695 Sending done @1242436667696 Sending @1242436667696 Sending done @1242436667696 Sending Sending takes somewhat long here, (7695 ms - 7668 ms = 27ms) and the user thread finds few free buffers left, and then finally none and goes on to force a flush. But with the patch, the call to forceflush at instant 7679 must wait till the shiper thread's send is done; at instant 7695. Since the shipper thread has held the monitor since before the instant it released a free buffer (implying a the user thread can not have been able to grab it yet!), by the time the user thread gets the monitor on forceFlushSemaphore, the sender is done, and a free buffer is guaranteed to have been returned to the pool. I have run ReplicationRun_Local_StateTest_part1 now for 24 hours without seeing a problem with it (ca 350 runs). Running full regressions now. Ready for review.
          Hide
          Dag H. Wanvik added a comment -

          Committed patch derby-3719-1 as svn 782991.

          Show
          Dag H. Wanvik added a comment - Committed patch derby-3719-1 as svn 782991.
          Hide
          Dag H. Wanvik added a comment -

          Backported this fix to 10.5 branch; committed as svn 782997.
          Resolving.

          Show
          Dag H. Wanvik added a comment - Backported this fix to 10.5 branch; committed as svn 782997. Resolving.
          Hide
          Ole Solberg added a comment -

          Not seen since commit, so closing.

          Show
          Ole Solberg added a comment - Not seen since commit, so closing.

            People

            • Assignee:
              Dag H. Wanvik
              Reporter:
              Ole Solberg
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development