Issue Details (XML | Word | Printable)

Key: DERBY-4175
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Dag H. Wanvik
Reporter: Dag H. Wanvik
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Derby

Instability in some replication tests under load, since tests don't wait long enough for final state or anticipate intermediate states

Created: 21/Apr/09 11:57 PM   Updated: 28/Jul/09 08:31 PM
Component/s: Replication, Test
Affects Version/s: 10.6.0.0
Fix Version/s: 10.5.3.0, 10.6.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works derby-4175-2.diff 2009-04-23 03:29 AM Dag H. Wanvik 6 kB
File Licensed for inclusion in ASF works derby-4175-3.diff 2009-04-24 04:02 PM Dag H. Wanvik 12 kB
File Licensed for inclusion in ASF works derby-4175-3.stat 2009-04-24 04:02 PM Dag H. Wanvik 0.4 kB
File Licensed for inclusion in ASF works derby-4175.diff 2009-04-22 12:15 AM Dag H. Wanvik 3 kB
File Licensed for inclusion in ASF works derby-4175.stat 2009-04-22 12:15 AM Dag H. Wanvik 0.1 kB
Environment: Solaris 2008.11. snv_111 (x86) on trunk.
Issue Links:
Reference
 

Bug behavior facts: Regression Test Failure
Resolution Date: 29/Apr/09 12:47 AM
Labels:


 Description  « Hide
The test expects REPLICATION_DB_NOT_BOOTED (XRE11), but sees
REPLICATION_SLAVE_SHUTDOWN_OK (XRE 42):

1) testReplication_Local_StateTest_part1_1(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_1)junit.framework.AssertionFailedError: jdbc:derby://localhost:4527//export/home/dag/java/sb/tests/derby-3417a-replicationTests.ReplicationSuite-sb4.jars.sane-1.6.0_13-21079/db_slave/wombat;stopSlave=true failed: -1 XRE42 DERBY SQL error: SQLCODE: -1, SQLSTATE: XRE42, SQLERRMC: /export/home/dag/java/sb/tests/derby-3417a-replicationTests.ReplicationSuite-sb4.jars.sane-1.6.0_13-21079/db_slave/wombat^TXRE42
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_1._testPostStartedMasterAndSlave_StopSlave(ReplicationRun_Local_StateTest_part1_1.java:226)
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_1.testReplication_Local_StateTest_part1_1(ReplicationRun_Local_StateTest_part1_1.java:130)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:105)
at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
at junit.extensions.TestSetup.run(TestSetup.java:25)


I think this is a race condition: when the slave receives a message to
shut down (this is what happens here when the server is master's
server is shut down) it takes some time for this to happen, and in the
meantime a stopSlave on the slave will get
REPLICATION_SLAVE_SHUTDOWN_OK.

In the code, there is a sleep just ahead of the failing stopSlave to
avoid this scenario:

        // Take down master - slave connection:
        killMaster(masterServerHost, masterServerPort);
        Thread.sleep(5000L); // TEMPORARY to see if slave sees that master is gone!

and I guess on my laptop, the 5 seconds was not enough. I think it
would be better to accept both states here as acceptable, than make
the test brittle. If this is a bug - that we sometimes see
REPLICATION_SLAVE_SHUTDOWN_OK - and it may well be, since ahead of the
master stop, we would see SLAVE_OPERATION_DENIED_WHILE_CONNECTED
(XRE41), I think - then this should be logged as a separate issue.

In contrast, I think that if connection to the master is *lost*, a
stopSlave on slave would see REPLICATION_SLAVE_SHUTDOWN_OK as the
normal response.

(edit 2009.04.24 by dagw): It turns out that a similar problem is present in
ReplicationRun_Local_StateTest_part1_2, but there the symptom is that
a connect succeeds, but is expected to fail. It does should fail
eventually if given enough time . The error expected is 08004.C.7 in
this case (CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE).'


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Dag H. Wanvik added a comment - 22/Apr/09 12:15 AM
Patch which solves the issue. The Thread.sleep is still there, meaning that in most cases, the XRE11 would be seen, which seems the right. I feel this is a bit of a kludge, so please chime in on who we should solve this brittleness..

Dag H. Wanvik made changes - 22/Apr/09 12:15 AM
Field Original Value New Value
Attachment derby-4175.diff [ 12406076 ]
Attachment derby-4175.stat [ 12406077 ]
Dag H. Wanvik added a comment - 23/Apr/09 03:29 AM
Uploading a second version of this patch, we shows that there are actually
two intermediate states we could encounter before we reach the end state in step 3. I don't this this behavior is well documented if at all. The state labelled b) in the code comment is a bit murky..

The new code uses a loop to wait until the final expected end state is reached.
Uncomment the printlns inside the loop to see what happens.

This patch should make this instability go away, running regressions.

Dag H. Wanvik made changes - 23/Apr/09 03:29 AM
Attachment derby-4175-2.diff [ 12406199 ]
Dag H. Wanvik made changes - 23/Apr/09 03:29 AM
Assignee Dag H. Wanvik [ dagw ]
Dag H. Wanvik made changes - 23/Apr/09 03:30 AM
Derby Info [Patch Available]
Knut Anders Hatlen added a comment - 24/Apr/09 02:23 PM
The approach with a loop looks fine to me. One small issue:

If the SQLState is REPLICATION_DB_NOT_BOOTED, gotEx will be null, so this code is unnecessarily complex and somewhat misleading:

+ if (gotEx != null) {
+ // always fails
+ BaseJDBCTestCase.assertSQLState(
+ connectionURL + " failed: ",
+ REPLICATION_DB_NOT_BOOTED,
+ gotEx);
         }

Perhaps it should be replaced with
    if (gotEx != null) throw gotEx;
?

Dag H. Wanvik made changes - 24/Apr/09 03:43 PM
Summary See XRE42 in ReplicationRun_Local_StateTest_part1_1._testPostStartedMasterAndSlave_StopSlave, XRE11 expected Instability in some replication tests under load, since tests don't wait long enough for finals state or anticipate intermediate states
Dag H. Wanvik made changes - 24/Apr/09 03:43 PM
Summary Instability in some replication tests under load, since tests don't wait long enough for finals state or anticipate intermediate states Instability in some replication tests under load, since tests don't wait long enough for final state or anticipate intermediate states
Dag H. Wanvik added a comment - 24/Apr/09 04:02 PM - edited
Thanks for looking at this Knut; I agree, I had done the same
simplification in another version on the waiting loop logic (new in
spin #3 of this patch) when I saw your comment :)

Uploading version # of this patch, which adds another instance of a a
missing wait; which made the test
ReplicationRun_Local_StateTest_part1_2 fail if the machine was loaded.

The patch also increases wait from 0.1s to 1 seconds in two instances,
which did have waiting loops already, but waitined no longer than 2
seconds in all; which was too little under load on my box.

I generally make the waiting loops wait do 20 attempts, waiting 1 second
between each attempt now. That should be sufficient even for a
reasonably loaded machine :)

I also replaced the commented out "System.err.println"s with the debug
util call that's used in these tests to keep things uniform, although
such calls should probably in time be replaced by the normal
BaseTestCase.println (outside scope here).

Dag H. Wanvik made changes - 24/Apr/09 04:02 PM
Attachment derby-4175-3.diff [ 12406372 ]
Attachment derby-4175-3.stat [ 12406373 ]
Dag H. Wanvik made changes - 24/Apr/09 04:09 PM
Description The test expects REPLICATION_DB_NOT_BOOTED (XRE11), but sees
REPLICATION_SLAVE_SHUTDOWN_OK (XRE 42):

1) testReplication_Local_StateTest_part1_1(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_1)junit.framework.AssertionFailedError: jdbc:derby://localhost:4527//export/home/dag/java/sb/tests/derby-3417a-replicationTests.ReplicationSuite-sb4.jars.sane-1.6.0_13-21079/db_slave/wombat;stopSlave=true failed: -1 XRE42 DERBY SQL error: SQLCODE: -1, SQLSTATE: XRE42, SQLERRMC: /export/home/dag/java/sb/tests/derby-3417a-replicationTests.ReplicationSuite-sb4.jars.sane-1.6.0_13-21079/db_slave/wombat^TXRE42
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_1._testPostStartedMasterAndSlave_StopSlave(ReplicationRun_Local_StateTest_part1_1.java:226)
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_1.testReplication_Local_StateTest_part1_1(ReplicationRun_Local_StateTest_part1_1.java:130)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:105)
at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
at junit.extensions.TestSetup.run(TestSetup.java:25)


I think this is a race condition: when the slave receives a message to
shut down (this is what happens here when the server is master's
server is shut down) it takes some time for this to happen, and in the
meantime a stopSlave on the slave will get
REPLICATION_SLAVE_SHUTDOWN_OK.

In the code, there is a sleep just ahead of the failing stopSlave to
avoid this scenario:

        // Take down master - slave connection:
        killMaster(masterServerHost, masterServerPort);
        Thread.sleep(5000L); // TEMPORARY to see if slave sees that master is gone!

and I guess on my laptop, the 5 seconds was not enough. I think it
would be better to accept both states here as acceptable, than make
the test brittle. If this is a bug - that we sometimes see
REPLICATION_SLAVE_SHUTDOWN_OK - and it may well be, since ahead of the
master stop, we would see SLAVE_OPERATION_DENIED_WHILE_CONNECTED
(XRE41), I think - then this should be logged as a separate issue.

In contrast, I think that if connection to the master is *lost*, a
stopSlave on slave would see REPLICATION_SLAVE_SHUTDOWN_OK as the
normal response.
The test expects REPLICATION_DB_NOT_BOOTED (XRE11), but sees
REPLICATION_SLAVE_SHUTDOWN_OK (XRE 42):

1) testReplication_Local_StateTest_part1_1(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_1)junit.framework.AssertionFailedError: jdbc:derby://localhost:4527//export/home/dag/java/sb/tests/derby-3417a-replicationTests.ReplicationSuite-sb4.jars.sane-1.6.0_13-21079/db_slave/wombat;stopSlave=true failed: -1 XRE42 DERBY SQL error: SQLCODE: -1, SQLSTATE: XRE42, SQLERRMC: /export/home/dag/java/sb/tests/derby-3417a-replicationTests.ReplicationSuite-sb4.jars.sane-1.6.0_13-21079/db_slave/wombat^TXRE42
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_1._testPostStartedMasterAndSlave_StopSlave(ReplicationRun_Local_StateTest_part1_1.java:226)
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_1.testReplication_Local_StateTest_part1_1(ReplicationRun_Local_StateTest_part1_1.java:130)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:105)
at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
at junit.extensions.TestSetup.run(TestSetup.java:25)


I think this is a race condition: when the slave receives a message to
shut down (this is what happens here when the server is master's
server is shut down) it takes some time for this to happen, and in the
meantime a stopSlave on the slave will get
REPLICATION_SLAVE_SHUTDOWN_OK.

In the code, there is a sleep just ahead of the failing stopSlave to
avoid this scenario:

        // Take down master - slave connection:
        killMaster(masterServerHost, masterServerPort);
        Thread.sleep(5000L); // TEMPORARY to see if slave sees that master is gone!

and I guess on my laptop, the 5 seconds was not enough. I think it
would be better to accept both states here as acceptable, than make
the test brittle. If this is a bug - that we sometimes see
REPLICATION_SLAVE_SHUTDOWN_OK - and it may well be, since ahead of the
master stop, we would see SLAVE_OPERATION_DENIED_WHILE_CONNECTED
(XRE41), I think - then this should be logged as a separate issue.

In contrast, I think that if connection to the master is *lost*, a
stopSlave on slave would see REPLICATION_SLAVE_SHUTDOWN_OK as the
normal response.

(edit 2009.04.24 by dagw): It turns out that a similar problem is present in
ReplicationRun_Local_StateTest_part1_2, but there the symptom is that
a connect succeeds, but is expected to fail. It does should fail
eventually if given enough time . The error expected is 08004.C.7 in
this case (CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE).'
Dag H. Wanvik made changes - 24/Apr/09 06:47 PM
Derby Categories [Regression Test Failure]
Affects Version/s 10.6.0.0 [ 12313727 ]
Knut Anders Hatlen added a comment - 25/Apr/09 05:11 PM
#3 looks fine to me. +1 to commit.

Dag H. Wanvik made changes - 27/Apr/09 11:18 PM
Component/s Test [ 11413 ]
Component/s Regression Test Failure [ 12310664 ]
Component/s Replication [ 12312050 ]
Dag H. Wanvik made changes - 27/Apr/09 11:30 PM
Component/s Replication [ 12312050 ]
Repository Revision Date User Message
ASF #769602 Wed Apr 29 00:44:47 UTC 2009 dag DERBY-4175 Instability in some replication tests under load, since tests don't wait long enough for final state or anticipate intermediate states

Patch DERBY-4175-3 (+ resolved some conflicts arising from commit of DERBY-3417).

It makes three replication tests less sensitive to load by making
them accept intermediate states without failing or wait for longer
before giving up on seeing the final end state of a replication state
change.
Files Changed
MODIFY /db/derby/code/trunk/java/testing/org/apache/derbyTesting/functionTests/tests/replicationTests/ReplicationRun_Local_StateTest_part1_2.java
MODIFY /db/derby/code/trunk/java/testing/org/apache/derbyTesting/functionTests/tests/replicationTests/ReplicationRun_Local_StateTest_part1_1.java
MODIFY /db/derby/code/trunk/java/testing/org/apache/derbyTesting/functionTests/tests/replicationTests/ReplicationRun_Local_3_p3.java

Dag H. Wanvik added a comment - 29/Apr/09 12:45 AM
Committed #3 as svn 769602, resolving. Not closing yet, may want to backport.

Dag H. Wanvik made changes - 29/Apr/09 12:47 AM
Status Open [ 1 ] Resolved [ 5 ]
Derby Info [Patch Available]
Resolution Fixed [ 1 ]
Dag H. Wanvik made changes - 14/May/09 11:54 PM
Link This issue relates to DERBY-3632 [ DERBY-3632 ]
Dag H. Wanvik made changes - 11/Jun/09 12:35 PM
Link This issue relates to DERBY-4246 [ DERBY-4246 ]
Repository Revision Date User Message
ASF #798700 Tue Jul 28 20:27:54 UTC 2009 dag DERBY-4175 Instability in some replication tests under load, since tests don't wait long enough for final state or anticipate intermediate states

Backported fix to 10.5 branch as

svn merge -c 769602 https://svn.eu.apache.org/repos/asf/db/derby/code/trunk
Files Changed
MODIFY /db/derby/code/branches/10.5/java/testing/org/apache/derbyTesting/functionTests/tests/replicationTests/ReplicationRun_Local_StateTest_part1_2.java
MODIFY /db/derby/code/branches/10.5/java/testing/org/apache/derbyTesting/functionTests/tests/replicationTests/ReplicationRun_Local_StateTest_part1_1.java
MODIFY /db/derby/code/branches/10.5/java/testing/org/apache/derbyTesting/functionTests/tests/replicationTests/ReplicationRun_Local_3_p3.java
MODIFY /db/derby/code/branches/10.5

Dag H. Wanvik added a comment - 28/Jul/09 08:30 PM
Backported to 10.5 branch as svn 798700, closing.

Dag H. Wanvik made changes - 28/Jul/09 08:30 PM
Status Resolved [ 5 ] Closed [ 6 ]
Dag H. Wanvik made changes - 28/Jul/09 08:31 PM
Fix Version/s 10.5.2.1 [ 12314117 ]
Fix Version/s 10.6.0.0 [ 12313727 ]