Derby
  1. Derby
  2. DERBY-4186

After master stop, test fails when it succeeds in connecting (rebooting) shut-down ex-slave

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.6.1.0
    • Fix Version/s: 10.5.3.0, 10.6.1.0
    • Component/s: Replication, Test
    • Labels:
      None
    • Bug behavior facts:
      Regression Test Failure

      Description

      Occasionally I see this error in ReplicationRun_Local_3_p3:

      1) testReplication_Local_3_p3_StateNegativeTests(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p3)junit.framework.AssertionFailedError: Expected SQLState'08004', but got connection!
      at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.waitForSQLState(ReplicationRun.java:332)
      at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p3.testReplication_Local_3_p3_StateNegativeTests(ReplicationRun_Local_3_p3.java:170)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:105)
      at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
      at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
      at junit.extensions.TestSetup.run(TestSetup.java:25)

      In the code, after a stopMaster is given to the master (should lead to fail-over),
      the tests expects to see CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE (08004.C.7), which will only succeed if
      the tests gets to try to connect before the failover has started. This seems wrong. If the failover has completed, it should expect a successful
      connect (which boots the database, btw, since its shut down after auccessful failover).

      Quote from code:

      waitForSQLState("08004", 100L, 20, // 08004.C.7 - CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE
      slaveDatabasePath + FS + slaveDbSubPath + FS + replicatedDb,
      slaveServerHost, slaveServerPort); // _failOver above fails...

      There is a race between the failover on the slave and the test here I think.

      1. derby-4186-followup.stat
        0.1 kB
        Dag H. Wanvik
      2. derby-4186-followup.diff
        0.7 kB
        Dag H. Wanvik
      3. derby-4186-2.stat
        0.2 kB
        Dag H. Wanvik
      4. derby-4186-2.diff
        8 kB
        Dag H. Wanvik
      5. derby-4186.stat
        0.2 kB
        Dag H. Wanvik
      6. derby-4186.diff
        8 kB
        Dag H. Wanvik
      7. ok-slave.txt
        6 kB
        Dag H. Wanvik
      8. bad-slave.txt
        50 kB
        Dag H. Wanvik

        Issue Links

          Activity

          Hide
          Kathey Marsden added a comment -

          Dag said:
          >Occasionally I see this error in ReplicationRun_Local_3_p3:
          FYI: This is filed and DERBY-3915.

          Show
          Kathey Marsden added a comment - Dag said: >Occasionally I see this error in ReplicationRun_Local_3_p3: FYI: This is filed and DERBY-3915 .
          Hide
          Dag H. Wanvik added a comment - - edited

          My initial analysis was not entirely correct. Looking at the log file, I see that the
          setting up of the master never succeeded in the cases where we see 08004.C.7.
          This in turn lead to the stopMaster to fail (there is no master yet!), but operation does not throw because of this
          piece of code in MasterController.tearDownNetwork called from MasterController.stopMaster

          try

          { ReplicationMessage mesg = new ReplicationMessage(ReplicationMessage.TYPE_STOP, null); transmitter.sendMessage(mesg); }

          catch (IOException ioe) {} // <************ java.net.ConnectException: Connection refused
          try

          { transmitter.tearDown(); }

          catch (IOException ioe) {}

          The end result of this is that the slave is still listening when the test comes around to calling to waitForSQLState (seethe
          issue description), so we naturally get 08004.C.7 CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE.
          But the test is also wrong, it should expect success here.

          Now the next question is, why does the test think starting the master worked? It calls the method ReplicationRun.startMaster to
          achieve this.

          [2009.04.27 comment added: this turned out to be a red herring, see below.]

          Show
          Dag H. Wanvik added a comment - - edited My initial analysis was not entirely correct. Looking at the log file, I see that the setting up of the master never succeeded in the cases where we see 08004.C.7. This in turn lead to the stopMaster to fail (there is no master yet!), but operation does not throw because of this piece of code in MasterController.tearDownNetwork called from MasterController.stopMaster try { ReplicationMessage mesg = new ReplicationMessage(ReplicationMessage.TYPE_STOP, null); transmitter.sendMessage(mesg); } catch (IOException ioe) {} // <************ java.net.ConnectException: Connection refused try { transmitter.tearDown(); } catch (IOException ioe) {} The end result of this is that the slave is still listening when the test comes around to calling to waitForSQLState (seethe issue description), so we naturally get 08004.C.7 CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE. But the test is also wrong, it should expect success here. Now the next question is, why does the test think starting the master worked? It calls the method ReplicationRun.startMaster to achieve this. [2009.04.27 comment added: this turned out to be a red herring, see below.]
          Hide
          Dag H. Wanvik added a comment -

          Thanks for noticing the duplicate, Kathey. I closed DERBY-3915 as dup, since I already had started making comments on this issue.

          Show
          Dag H. Wanvik added a comment - Thanks for noticing the duplicate, Kathey. I closed DERBY-3915 as dup, since I already had started making comments on this issue.
          Hide
          Ole Solberg added a comment - - edited

          I had a look at this test again, and see that the comment on the stop master command is missing a NOT!
          This of course creates confusion in understanding how the test should behave.

          assertException(
          _stopMaster(masterServerHost, masterServerPort,
          masterDatabasePath + FS + masterDbSubPath + FS + replicatedDb),
          null); // Implies failover. // OK to continue.
          The correct is that stop master does NOT imply a failover.
          But the stop master implies a stop slave to the slave,
          which then should allow connections to both the master and the slave db!

          The test should therefore be:
          1. stop master should succeed. - OK
          2. wait for connection on the master. - OK
          3. wait for connection on the slave. - FAILS
          This currently fails with CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE
          which is incorrect!
          This is possibly the same error as DERBY-3617 - Failover on slave hangs
          after stopmaster on master.
          4. failover on master should fail with REPLICATION_NOT_IN_MASTER_MODE - OK
          5. wait for connection on the slave.(Currently fails as in 3.!) - FAILS

          From the spec:

          https://issues.apache.org/jira/secure/attachment/12377517/replication_funcspec_v10.html

          --------------------------------------------------------------------------------
          Stop master:
          =========
          Sends a stop slave message to the slave if the network connection is working.
          Then shuts down all replication related functionality without shutting down the database.

          Stop slave:
          ========
          All transaction log that has been received from the master is written to disk.
          The slave replication functionality and the database is then shut down.
          --------------------------------------------------------------------------------

          After stop master it should be possible to connect to the slave db!
          Currently we get CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE!

          Show
          Ole Solberg added a comment - - edited I had a look at this test again, and see that the comment on the stop master command is missing a NOT! This of course creates confusion in understanding how the test should behave. assertException( _stopMaster(masterServerHost, masterServerPort, masterDatabasePath + FS + masterDbSubPath + FS + replicatedDb), null); // Implies failover. // OK to continue. The correct is that stop master does NOT imply a failover. But the stop master implies a stop slave to the slave, which then should allow connections to both the master and the slave db! The test should therefore be: 1. stop master should succeed. - OK 2. wait for connection on the master. - OK 3. wait for connection on the slave. - FAILS This currently fails with CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE which is incorrect! This is possibly the same error as DERBY-3617 - Failover on slave hangs after stopmaster on master. 4. failover on master should fail with REPLICATION_NOT_IN_MASTER_MODE - OK 5. wait for connection on the slave.(Currently fails as in 3.!) - FAILS From the spec: https://issues.apache.org/jira/secure/attachment/12377517/replication_funcspec_v10.html -------------------------------------------------------------------------------- Stop master: ========= Sends a stop slave message to the slave if the network connection is working. Then shuts down all replication related functionality without shutting down the database. Stop slave: ======== All transaction log that has been received from the master is written to disk. The slave replication functionality and the database is then shut down. -------------------------------------------------------------------------------- After stop master it should be possible to connect to the slave db! Currently we get CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE!
          Hide
          Dag H. Wanvik added a comment -

          Thanks for clarifying the expected behavior here, Ole. It wasn't clear to me what should happen on the
          slave when the master is stopped.

          I quote from docs: "To stop replication of a database, connect to the master database using the stopMaster=true connection URL attribute. The master sends the remaining log records that await shipment, and then sends a stop replication command to the slave. The slave then writes all logs to disk and shuts down the database."

          This seems to conflict with the spec: "without shutting down the database". But the code seems to try to shut down the db.
          So I think docs are correct, the spec is out of date on this point..?

          I have investigated a bit and it seems what happens when the connection is denied on the slave, is this:

          • The slave receives the stop message
          • It stops reading log, and breaks of recover,
          • The boot halts, but the db doesn't really shut down completely.

          I haven't been able to determine why it doesn't shut down fully yet.

          Show
          Dag H. Wanvik added a comment - Thanks for clarifying the expected behavior here, Ole. It wasn't clear to me what should happen on the slave when the master is stopped. I quote from docs: "To stop replication of a database, connect to the master database using the stopMaster=true connection URL attribute. The master sends the remaining log records that await shipment, and then sends a stop replication command to the slave. The slave then writes all logs to disk and shuts down the database." This seems to conflict with the spec: "without shutting down the database". But the code seems to try to shut down the db. So I think docs are correct, the spec is out of date on this point..? I have investigated a bit and it seems what happens when the connection is denied on the slave, is this: The slave receives the stop message It stops reading log, and breaks of recover, The boot halts, but the db doesn't really shut down completely. I haven't been able to determine why it doesn't shut down fully yet.
          Hide
          Dag H. Wanvik added a comment -

          Further analysis shows that the uncaught IOException I referred to
          when trying to send a stop message to the slave is just a part of the
          generic MasterController.tearDownNetwork when establishment of the master fails initially.
          The test correctly loops to wait for successful startup of the master and there will be
          some failed attempts (waiting for the slave to be ready) which visit that failure code path.
          It is not a problem though, just (another) red herring.

          I enclose a patch proposal, which addresses the real issue: the slave does not shut down when it should.
          The scenario is that the slave receives a stop replication message before it had time to complete
          the slave boot (race condition), see attachment bad-slave.txt. If I make the test client
          wait instead of proceeding to stop the master, the slave log looks like the one in ok-slave.txt (attached).

          It would be nice if any of the original authors had a look at this patch as I am not familiar with this code.

          The patch also modifies the test client to loop until success, accepting intermediate state
          CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE.

          Running regressions.

          Show
          Dag H. Wanvik added a comment - Further analysis shows that the uncaught IOException I referred to when trying to send a stop message to the slave is just a part of the generic MasterController.tearDownNetwork when establishment of the master fails initially. The test correctly loops to wait for successful startup of the master and there will be some failed attempts (waiting for the slave to be ready) which visit that failure code path. It is not a problem though, just (another) red herring. I enclose a patch proposal, which addresses the real issue: the slave does not shut down when it should. The scenario is that the slave receives a stop replication message before it had time to complete the slave boot (race condition), see attachment bad-slave.txt. If I make the test client wait instead of proceeding to stop the master, the slave log looks like the one in ok-slave.txt (attached). It would be nice if any of the original authors had a look at this patch as I am not familiar with this code. The patch also modifies the test client to loop until success, accepting intermediate state CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE. Running regressions.
          Hide
          Dag H. Wanvik added a comment -

          Respin of this patch, #2, with more comments. I also talked to the author of this code off-line, Jørgen Løland, and he agreed with my analysis. The new patch moves the check for the lost exception to inside the method SlaveDataBase.verifySuccessfulBoot.
          Added more explanations in the comments.

          Show
          Dag H. Wanvik added a comment - Respin of this patch, #2, with more comments. I also talked to the author of this code off-line, Jørgen Løland, and he agreed with my analysis. The new patch moves the check for the lost exception to inside the method SlaveDataBase.verifySuccessfulBoot. Added more explanations in the comments.
          Hide
          Dag H. Wanvik added a comment -

          Commited derby-4186-2 as svn 769962. Quote from the commit log:

          a) fixing a bug in SlaveDataBase, which loses an exception iff a
          command to stop replication arrives from the master to the slave
          before the initial connection (successful slave started) command
          returns. This is a corner case race condition, and not very likely to
          occur in practice, since it makes little sense to stop replication
          immediately after starting it.

          b) fixing a bug in the test, which switched on its head what is really
          expected behavior, and also adds a waiting loop to allow intermediate
          state on slave after the master is stopped. Also fixed some erroneous
          comments and removed some cruft.

          Show
          Dag H. Wanvik added a comment - Commited derby-4186-2 as svn 769962. Quote from the commit log: a) fixing a bug in SlaveDataBase, which loses an exception iff a command to stop replication arrives from the master to the slave before the initial connection (successful slave started) command returns. This is a corner case race condition, and not very likely to occur in practice, since it makes little sense to stop replication immediately after starting it. b) fixing a bug in the test, which switched on its head what is really expected behavior, and also adds a waiting loop to allow intermediate state on slave after the master is stopped. Also fixed some erroneous comments and removed some cruft.
          Hide
          Dag H. Wanvik added a comment -

          Repost answer from Jørgen (Thu, 30 Apr 2009 08:30:05 +0200) here:

          Dag,

          Thanks for analyzing and fixing this strange issue! Stopping
          replication before the startSlave command had completed was never on
          my mind :-/

          I had a look at you patch though, and I think you can fix this bug
          with even less code.

          From SlaveDatabase.java:86:
          /** Set by the database boot thread if it fails before slave mode

          • has been started properly (i.e., if inBoot is true). This
          • exception will then be reported to the client connection. */
            private volatile StandardException bootException;

          bootException is only set in one place -
          SlaveDatabase#handleShutdown. There you'll also see the reason for the
          limbo state that made the tests fail: if an exception makes the slave
          replication code call handleShutdown while booting is in progress, the
          database is supposed to be shutdown by the client thread when it
          receives an exception from SlaveDatabase.boot().

          As you already found out, that didn't happen because the bootException
          was set during the 500 millis waiting in verifySuccesfulBoot. However,
          this should apply to any exception in bootException, not only
          DATABASE_SEVERITY ones (although I think only DB severity exceptions
          will be reported here).

          I would go with the same code that is inside the while. Thus, instead of

          + if (bootException != null &&
          + SQLState.SHUTDOWN_DATABASE.startsWith(
          + bootException.getSQLState()) &&
          + bootException.getSeverity() ==
          ExceptionSeverity.DATABASE_SEVERITY) {

          use

          + if (bootException != null)

          Show
          Dag H. Wanvik added a comment - Repost answer from Jørgen (Thu, 30 Apr 2009 08:30:05 +0200) here: Dag, Thanks for analyzing and fixing this strange issue! Stopping replication before the startSlave command had completed was never on my mind :-/ I had a look at you patch though, and I think you can fix this bug with even less code. From SlaveDatabase.java:86: /** Set by the database boot thread if it fails before slave mode has been started properly (i.e., if inBoot is true). This exception will then be reported to the client connection. */ private volatile StandardException bootException; bootException is only set in one place - SlaveDatabase#handleShutdown. There you'll also see the reason for the limbo state that made the tests fail: if an exception makes the slave replication code call handleShutdown while booting is in progress, the database is supposed to be shutdown by the client thread when it receives an exception from SlaveDatabase.boot(). As you already found out, that didn't happen because the bootException was set during the 500 millis waiting in verifySuccesfulBoot. However, this should apply to any exception in bootException, not only DATABASE_SEVERITY ones (although I think only DB severity exceptions will be reported here). I would go with the same code that is inside the while. Thus, instead of + if (bootException != null && + SQLState.SHUTDOWN_DATABASE.startsWith( + bootException.getSQLState()) && + bootException.getSeverity() == ExceptionSeverity.DATABASE_SEVERITY) { use + if (bootException != null)
          Hide
          Dag H. Wanvik added a comment -

          Thanks for your review. Jørgen!
          Your suggestion is helpful; I added the extra predicates
          since I didn't know all code paths through this code - the simplification is way better, I'll
          make that change once I have run the regressions again.

          Show
          Dag H. Wanvik added a comment - Thanks for your review. Jørgen! Your suggestion is helpful; I added the extra predicates since I didn't know all code paths through this code - the simplification is way better, I'll make that change once I have run the regressions again.
          Hide
          Dag H. Wanvik added a comment -

          Uploading derby-4186-followup, to simplify as Jørgen suggests.

          Show
          Dag H. Wanvik added a comment - Uploading derby-4186-followup, to simplify as Jørgen suggests.
          Hide
          Dag H. Wanvik added a comment -

          Committed derby-4186-followup as svn 772337.

          Show
          Dag H. Wanvik added a comment - Committed derby-4186-followup as svn 772337.
          Hide
          Mamta A. Satoor added a comment -

          I saw following failure on 10.5.2 (revision 794445) with ibm16 on a Windows machine
          1) testReplication_Local_3_p3_StateNegativeTests(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p3)junit.framework.AssertionFailedError: Expected SQLState'08004', but got connection!
          at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.waitForSQLState(ReplicationRun.java:333)
          at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p3.testReplication_Local_3_p3_StateNegativeTests(ReplicationRun_Local_3_p3.java:170)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
          at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:106)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:19)
          at junit.extensions.TestSetup.run(TestSetup.java:23)

          Show
          Mamta A. Satoor added a comment - I saw following failure on 10.5.2 (revision 794445) with ibm16 on a Windows machine 1) testReplication_Local_3_p3_StateNegativeTests(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p3)junit.framework.AssertionFailedError: Expected SQLState'08004', but got connection! at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.waitForSQLState(ReplicationRun.java:333) at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p3.testReplication_Local_3_p3_StateNegativeTests(ReplicationRun_Local_3_p3.java:170) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:106) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22) at junit.extensions.TestSetup$1.protect(TestSetup.java:19) at junit.extensions.TestSetup.run(TestSetup.java:23)
          Hide
          Dag H. Wanvik added a comment -

          Reopening, Mamta's find seem to indicate the fix isn't complete enough.

          Show
          Dag H. Wanvik added a comment - Reopening, Mamta's find seem to indicate the fix isn't complete enough.
          Hide
          Dag H. Wanvik added a comment -

          The fix was not ported to 10.5 (Mamta saw it on 10.5.2), but it is a very unlikely scenario in production,
          so unless somebody objects, I will close it again. Mamta, would you like to see this ported to the 10.5 branch?

          Show
          Dag H. Wanvik added a comment - The fix was not ported to 10.5 (Mamta saw it on 10.5.2), but it is a very unlikely scenario in production, so unless somebody objects, I will close it again. Mamta, would you like to see this ported to the 10.5 branch?
          Hide
          Mamta A. Satoor added a comment -

          Dag, I didn't follow the fix that went in for this jira, so if it is not a high-risk fix, may be we should backport it to 10.5. Since you are familiar with the code change that went in, I will be fine with your judgement call for backporting.

          Show
          Mamta A. Satoor added a comment - Dag, I didn't follow the fix that went in for this jira, so if it is not a high-risk fix, may be we should backport it to 10.5. Since you are familiar with the code change that went in, I will be fine with your judgement call for backporting.
          Hide
          Dag H. Wanvik added a comment -

          Backported the two fixes to 10.5 branch as svn 798710:

          svn merge -c 769962 https://svn.eu.apache.org/repos/asf/db/derby/code/trunk
          svn merge -c 772337 https://svn.eu.apache.org/repos/asf/db/derby/code/trunk

          Closing.

          Show
          Dag H. Wanvik added a comment - Backported the two fixes to 10.5 branch as svn 798710: svn merge -c 769962 https://svn.eu.apache.org/repos/asf/db/derby/code/trunk svn merge -c 772337 https://svn.eu.apache.org/repos/asf/db/derby/code/trunk Closing.

            People

            • Assignee:
              Dag H. Wanvik
              Reporter:
              Dag H. Wanvik
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development