Derby
  1. Derby
  2. DERBY-3639

Slave on Windows stops replication when network connection is broken, failover fails

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 10.4.1.3
    • Fix Version/s: None
    • Component/s: Replication
    • Environment:
      Master on Solaris 10 x86, Slave on Windows XP SP2.
      Slave VM: Sun's Java HotSpot Client VM build 1.6.0_03-b05.
    • Urgency:
      Normal
    • Issue & fix info:
      Workaround attached
    • Bug behavior facts:
      Crash

      Description

      Replication: Failover on slave fails after network connection is broken (network cable to slave pulled out); database is shut down due to an "unexpected error".
      Same experiment with a replicated embedded database on Linux (FC5) and Windows resulted in success on Linux and failure on Windows.

      Documentation (admin guide, "Replication failure handling") says:

      "Slave loses connection with master: The slave tries to reestablish the connection with the master by listening on the specified host and port. It will not give up until it is explicitly requested to do so by either the failover=true or stopSlave=true connection URL attribute. If a failover is requested, the slave applies all received log records and boots the database as described in Forcing a failover."

      Slave console:

      java -jar lib\derbyrun.jar ij

      ij version 10.4

      ij> connect 'jdbc:derby:replicDB;startSlave=true;slaveHost=0.0.0.0';

      ERROR XRE08: Replication slave mode started successfully for database 'replicDB'. Connection refused because the database is in replication slave mode

      .

      ij> connect 'jdbc:derby:replicDB';

      ERROR 08004: Connection refused to database 'replicDB' because it is in replication slave mode.

      ij> – network cable unplugged from slave

      ij> connect 'jdbc:derby:replicDB;failover=true';

      ERROR XRE11: Could not perform operation 'failover' because the database 'replicDB' has not been booted.

      ij> connect 'jdbc:derby:replicDB';

      ij>

      The slave's derby.log reported the following after the network cable was pulled and the failover command was issued:

      ---- BEGIN REPLICATION ERROR MESSAGE (4/22/08 2:01 PM) ----

      Replication slave got a fatal error for database 'replicDB'. Replication will be stopped.

      ERROR XRE03: Unexpected replication error. See derby.log for details.

      at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)

      at org.apache.derby.impl.store.replication.slave.SlaveController$SlaveLogReceiverThread.run(Unknown Source)

      Caused by: java.net.SocketException: Connection reset

      at java.net.SocketInputStream.read(Unknown Source)

      at java.net.SocketInputStream.read(Unknown Source)

      at java.io.ObjectInputStream$PeekInputStream.peek(Unknown Source)

      at java.io.ObjectInputStream$BlockDataInputStream.peek(Unknown Source)

      at java.io.ObjectInputStream$BlockDataInputStream.peekByte(Unknown Source)

      at java.io.ObjectInputStream.readObject0(Unknown Source)

      at java.io.ObjectInputStream.readObject(Unknown Source)

      at org.apache.derby.impl.store.replication.net.SocketConnection.readMessage(Unknown Source)

      at org.apache.derby.impl.store.replication.net.ReplicationMessageReceive.readMessage(Unknown Source)

      [snipped further stack traces]

      -------------------- END REPLICATION ERROR MESSAGE ---------------------

      Replication slave role was stopped for database 'replicDB'.

      Replication slave role was stopped for database 'replicDB'.

      2008-04-22 12:01:42.921 GMT:
      Shutting down instance 601a400f-0119-7600-eb23-000000383460

      ----------------------------------------------------------------

      Full derby.log from the slave is attached.

      1. derby-3639-1a-handle_connection_reset.diff
        2 kB
        Kristian Waagan
      2. derby.log
        10 kB
        John H. Embretsen

        Issue Links

          Activity

          Hide
          Øystein Grøvlen added a comment -

          I think this occurs because the SlaveLogReceiverThread expect to get an EOFException on readMessage if connection to master is lost. However, on Windows it seems that a SocketException 'Connection Reset' will occur. A solution could be to handle SocketException the same way as EOFException. I guess this may also have impact on how to behave in SlaveController#handleDisconnect.

          Workaround: Do a normal boot of the (former) slave database if the failover command fails. Given that the slave failure occurred when the network connection was lost, this should give the same database state as a failover.

          Show
          Øystein Grøvlen added a comment - I think this occurs because the SlaveLogReceiverThread expect to get an EOFException on readMessage if connection to master is lost. However, on Windows it seems that a SocketException 'Connection Reset' will occur. A solution could be to handle SocketException the same way as EOFException. I guess this may also have impact on how to behave in SlaveController#handleDisconnect. Workaround: Do a normal boot of the (former) slave database if the failover command fails. Given that the slave failure occurred when the network connection was lost, this should give the same database state as a failover.
          Hide
          John H. Embretsen added a comment -

          Downgraded Priority to Minor since the workaround is pretty straight forward and seems to work fine.

          Show
          John H. Embretsen added a comment - Downgraded Priority to Minor since the workaround is pretty straight forward and seems to work fine.
          Hide
          Knut Anders Hatlen added a comment -

          Triaged for 10.5.2.

          Show
          Knut Anders Hatlen added a comment - Triaged for 10.5.2.
          Hide
          Brett Bergquist added a comment -

          Could the difference be that on Windows, you have the option of having the hardware detect when the cable is unplugged and the network interface is completely torn down whereas on a Unix box, this might leave the network interface configured but the socket now gets the EOF after the TCP stack detects the timeout?

          Just a thought...

          Show
          Brett Bergquist added a comment - Could the difference be that on Windows, you have the option of having the hardware detect when the cable is unplugged and the network interface is completely torn down whereas on a Unix box, this might leave the network interface configured but the socket now gets the EOF after the TCP stack detects the timeout? Just a thought...
          Hide
          Kristian Waagan added a comment -

          Attaching patch 1a, which makes the slave controller interpret a SocketException with a message starting with "Connection reset" as a disconnect. When running the full regression suite I got one error in the replication tests, but running the suite manually afterwards resulted in no errors:
          testReplication_Local_StateTest_part1_2(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_2) ERROR:
          java.sql.SQLException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ001, SQLERRMC: org.apache.derby.shared.common.sanity.AssertFailure^TASSERT FAILED connection is closed^TXJ001.U
          at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:99)
          at org.apache.derby.client.am.SqlException.getSQLException(SqlException.java:358)
          at org.apache.derby.jdbc.ClientDriver.connect(ClientDriver.java:166)
          at java.sql.DriverManager.getConnection(DriverManager.java:582)
          at java.sql.DriverManager.getConnection(DriverManager.java:207)
          at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_2._testPostStartedMasterAndSlave_StopMaster(ReplicationRun_Local_StateTest_part1_2.java:226)
          at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_2.testReplication_Local_StateTest_part1_2(ReplicationRun_Local_StateTest_part1_2.java:137)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at junit.framework.TestCase.runTest(TestCase.java:164)
          at junit.framework.TestCase.runBare(TestCase.java:130)
          at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:112)
          at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.runBare(ReplicationRun.java:207)
          at junit.framework.TestResult$1.protect(TestResult.java:106)
          at junit.framework.TestResult.runProtected(TestResult.java:124)
          at junit.framework.TestResult.run(TestResult.java:109)
          at junit.framework.TestCase.run(TestCase.java:120)
          at junit.framework.TestSuite.runTest(TestSuite.java:230)
          at junit.framework.TestSuite.run(TestSuite.java:225)
          at junit.framework.TestSuite.runTest(TestSuite.java:230)
          at junit.framework.TestSuite.run(TestSuite.java:225)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
          at junit.framework.TestResult.runProtected(TestResult.java:124)
          at junit.extensions.TestSetup.run(TestSetup.java:25)
          at junit.framework.TestSuite.runTest(TestSuite.java:230)
          at junit.framework.TestSuite.run(TestSuite.java:225)
          at kah.TestRunner.main(TestRunner.java:77)
          Caused by: org.apache.derby.client.am.SqlException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ001, SQLERRMC: org.apache.derby.shared.common.sanity.AssertFailure^TASSERT FAILED connection is closed^TXJ001.U
          at org.apache.derby.client.am.Connection.completeSqlca(Connection.java:2125)
          at org.apache.derby.client.net.NetConnectionReply.parseRdbAccessFailed(NetConnectionReply.java:538)
          at org.apache.derby.client.net.NetConnectionReply.parseAccessRdbError(NetConnectionReply.java:431)
          at org.apache.derby.client.net.NetConnectionReply.parseACCRDBreply(NetConnectionReply.java:294)
          at org.apache.derby.client.net.NetConnectionReply.readAccessDatabase(NetConnectionReply.java:121)
          at org.apache.derby.client.net.NetConnection.readSecurityCheckAndAccessRdb(NetConnection.java:848)
          at org.apache.derby.client.net.NetConnection.flowSecurityCheckAndAccessRdb(NetConnection.java:771)
          at org.apache.derby.client.net.NetConnection.flowUSRIDONLconnect(NetConnection.java:603)
          at org.apache.derby.client.net.NetConnection.flowConnect(NetConnection.java:410)
          at org.apache.derby.client.net.NetConnection.<init>(NetConnection.java:220)
          at org.apache.derby.client.net.NetConnection40.<init>(NetConnection40.java:74)
          at org.apache.derby.client.net.ClientJDBCObjectFactoryImpl40.newNetConnection(ClientJDBCObjectFactoryImpl40.java:269)
          at org.apache.derby.jdbc.ClientDriver.connect(ClientDriver.java:157)
          ... 27 more

          I'll run the full suite again to make sure this isn't caused by the patch. I have also manually tested the patch on Windows (with Solaris 11 Express as the master).

          Patch ready for review.

          Brett,
          I don't know the exact details here, but cursory searches indicate that this is caused by the network stack behaving differently on Windows. Most of the messages were regarding closing the connection during normal operation, where RST was sent instead of FIN, but there seems to be a similar issue when the cable is unplugged.

          A test on my UNIX laptop suggested that it too was able to detect an unplugged cable instantly, but I don't know what actually happened on the protocol level.

          Show
          Kristian Waagan added a comment - Attaching patch 1a, which makes the slave controller interpret a SocketException with a message starting with "Connection reset" as a disconnect. When running the full regression suite I got one error in the replication tests, but running the suite manually afterwards resulted in no errors: testReplication_Local_StateTest_part1_2(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_2) ERROR: java.sql.SQLException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ001, SQLERRMC: org.apache.derby.shared.common.sanity.AssertFailure^TASSERT FAILED connection is closed^TXJ001.U at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:99) at org.apache.derby.client.am.SqlException.getSQLException(SqlException.java:358) at org.apache.derby.jdbc.ClientDriver.connect(ClientDriver.java:166) at java.sql.DriverManager.getConnection(DriverManager.java:582) at java.sql.DriverManager.getConnection(DriverManager.java:207) at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_2._testPostStartedMasterAndSlave_StopMaster(ReplicationRun_Local_StateTest_part1_2.java:226) at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_2.testReplication_Local_StateTest_part1_2(ReplicationRun_Local_StateTest_part1_2.java:137) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:164) at junit.framework.TestCase.runBare(TestCase.java:130) at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:112) at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.runBare(ReplicationRun.java:207) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:120) at junit.framework.TestSuite.runTest(TestSuite.java:230) at junit.framework.TestSuite.run(TestSuite.java:225) at junit.framework.TestSuite.runTest(TestSuite.java:230) at junit.framework.TestSuite.run(TestSuite.java:225) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24) at junit.extensions.TestSetup$1.protect(TestSetup.java:21) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.extensions.TestSetup.run(TestSetup.java:25) at junit.framework.TestSuite.runTest(TestSuite.java:230) at junit.framework.TestSuite.run(TestSuite.java:225) at kah.TestRunner.main(TestRunner.java:77) Caused by: org.apache.derby.client.am.SqlException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ001, SQLERRMC: org.apache.derby.shared.common.sanity.AssertFailure^TASSERT FAILED connection is closed^TXJ001.U at org.apache.derby.client.am.Connection.completeSqlca(Connection.java:2125) at org.apache.derby.client.net.NetConnectionReply.parseRdbAccessFailed(NetConnectionReply.java:538) at org.apache.derby.client.net.NetConnectionReply.parseAccessRdbError(NetConnectionReply.java:431) at org.apache.derby.client.net.NetConnectionReply.parseACCRDBreply(NetConnectionReply.java:294) at org.apache.derby.client.net.NetConnectionReply.readAccessDatabase(NetConnectionReply.java:121) at org.apache.derby.client.net.NetConnection.readSecurityCheckAndAccessRdb(NetConnection.java:848) at org.apache.derby.client.net.NetConnection.flowSecurityCheckAndAccessRdb(NetConnection.java:771) at org.apache.derby.client.net.NetConnection.flowUSRIDONLconnect(NetConnection.java:603) at org.apache.derby.client.net.NetConnection.flowConnect(NetConnection.java:410) at org.apache.derby.client.net.NetConnection.<init>(NetConnection.java:220) at org.apache.derby.client.net.NetConnection40.<init>(NetConnection40.java:74) at org.apache.derby.client.net.ClientJDBCObjectFactoryImpl40.newNetConnection(ClientJDBCObjectFactoryImpl40.java:269) at org.apache.derby.jdbc.ClientDriver.connect(ClientDriver.java:157) ... 27 more I'll run the full suite again to make sure this isn't caused by the patch. I have also manually tested the patch on Windows (with Solaris 11 Express as the master). Patch ready for review. Brett, I don't know the exact details here, but cursory searches indicate that this is caused by the network stack behaving differently on Windows. Most of the messages were regarding closing the connection during normal operation, where RST was sent instead of FIN, but there seems to be a similar issue when the cable is unplugged. A test on my UNIX laptop suggested that it too was able to detect an unplugged cable instantly, but I don't know what actually happened on the protocol level.
          Hide
          Knut Anders Hatlen added a comment -

          How reliable is it to check the message text of an exception? I imagine the message text could be different in different locales, or in different versions/implementations of the JVM. Could we perhaps handle all SocketExceptions (or possibly all IOExceptions, so that the current special case for EOFException can be removed) as disconnect initially? If the error actually is a fatal one that should make replication stop, I think it will be detected by handleDisconnect() (its call to setupConnection() will fail, and the error handler will call handleFatalException()).

          Show
          Knut Anders Hatlen added a comment - How reliable is it to check the message text of an exception? I imagine the message text could be different in different locales, or in different versions/implementations of the JVM. Could we perhaps handle all SocketExceptions (or possibly all IOExceptions, so that the current special case for EOFException can be removed) as disconnect initially? If the error actually is a fatal one that should make replication stop, I think it will be detected by handleDisconnect() (its call to setupConnection() will fail, and the error handler will call handleFatalException()).
          Hide
          Kristian Waagan added a comment -

          Clearing patch available, a new patch is required.

          Show
          Kristian Waagan added a comment - Clearing patch available, a new patch is required.

            People

            • Assignee:
              Unassigned
              Reporter:
              John H. Embretsen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development