Issue Details (XML | Word | Printable)

Key: DERBY-4185
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Dag H. Wanvik
Reporter: Dag H. Wanvik
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Derby

Make timeout settable or increase default for one replication message layer protocol.

Created: 24/Apr/09 04:52 PM   Updated: 28/Jul/09 09:42 PM
Component/s: Replication, Test
Affects Version/s: 10.6.0.0
Fix Version/s: 10.5.3.0, 10.6.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works derby-4185.diff 2009-04-27 10:55 PM Dag H. Wanvik 0.7 kB
Issue Links:
Reference
 

Bug behavior facts: Regression Test Failure
Resolution Date: 29/Apr/09 12:57 AM
Labels:


 Description  « Hide
When running the replication tests under machines with some load, I
experience intermittent errors (see also DERBY-4175). The symptom is
that when one is trying to fail over from the master (or shutting down
the master which has the same effect), a REPLICATION_CONNECTION_LOST
(XRE04.U.2) is seen.

The constant ReplicationMessageTransmit.DEFAULT_MESSAGE_RESPONSE_TIMEOUT is
hardwired to 5 seconds.
Cf Javadoc for ReplicationMessageTransmit.sendMessageWaitForReply:

    /**
     * Send a replication message to the slave and return the
     * message received as a response. Will only wait
     * DEFAULT_MESSAGE_RESPONSE_TIMEOUT millis for the response
     * message. If not received when the wait times out, no message is
     * returned. ..
     * :
     * @throws StandardException if the response message has not been received
     * after DEFAULT_MESSAGE_RESPONSE_TIMEOUT millis
     */
    public synchronized ReplicationMessage
        sendMessageWaitForReply(ReplicationMessage message)

If this constant is increased to 10 seconds, I do not this see this
issue any longer on my machine (a fairly new Lenovo T61p, Intel Core 2
Duo CPU T9300 @ 2.50GHz).

One concern here is the stability of the tests. Another is for user
applications. If the tests see this issue, user apps may as well. I am
unsure whether we should provide a knob to tweak this (user settable
attribute), or whether it would be sufficient to up the constant to
say, 10 or 20 seconds. I think one should be able to use Derby
replication also on machines with some load :) What detrimental
effects could increasing this constant have?

An example (See another example here:
http://issues.apache.org/jira/browse/DERBY-3417?focusedCommentId=12701731&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12701731 ):

There was 1 failure:
1) testReplication_Local_3_p4_StateNegativeTests(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p4)junit.framework.ComparisonFailure: Unexpected SQL state. expected:<XRE[20]> but was:<XRE[04]>
at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertSQLState(BaseJDBCTestCase.java:762)
at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertSQLState(BaseJDBCTestCase.java:797)
at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertSQLState(BaseJDBCTestCase.java:811)
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver_direct(ReplicationRun.java:1381)
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver(ReplicationRun.java:1302)
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p4.testReplication_Local_3_p4_StateNegativeTests(ReplicationRun_Local_3_p4.java:155)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:105)
at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
at junit.extensions.TestSetup.run(TestSetup.java:25)
Caused by: java.sql.SQLNonTransientConnectionException: DERBY SQL error: SQLCODE: 0, SQLSTATE: XRE04, SQLERRMC: nullXRE04.U.2
at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:70)
at org.apache.derby.client.am.SqlException.getSQLException(SqlException.java:358)
at org.apache.derby.client.am.SqlException.getSQLException(SqlException.java:367)
at org.apache.derby.jdbc.ClientDriver.connect(ClientDriver.java:149)
at java.sql.DriverManager.getConnection(DriverManager.java:582)
at java.sql.DriverManager.getConnection(DriverManager.java:207)
at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.failOver_direct(ReplicationRun.java:1368)
... 26 more
Caused by: org.apache.derby.client.am.SqlException: DERBY SQL error: SQLCODE: 0, SQLSTATE: XRE04, SQLERRMC: nullXRE04.U.2
at org.apache.derby.client.am.Connection.completeSqlca(Connection.java:2082)
at org.apache.derby.client.net.NetConnectionReply.parseRdbAccessFailed(NetConnectionReply.java:540)
at org.apache.derby.client.net.NetConnectionReply.parseAccessRdbError(NetConnectionReply.java:433)
at org.apache.derby.client.net.NetConnectionReply.parseACCRDBreply(NetConnectionReply.java:297)
at org.apache.derby.client.net.NetConnectionReply.readAccessDatabase(NetConnectionReply.java:121)
at org.apache.derby.client.net.NetConnection.readSecurityCheckAndAccessRdb(NetConnection.java:835)
at org.apache.derby.client.net.NetConnection.flowSecurityCheckAndAccessRdb(NetConnection.java:759)
at org.apache.derby.client.net.NetConnection.flowUSRIDONLconnect(NetConnection.java:592)
at org.apache.derby.client.net.NetConnection.flowConnect(NetConnection.java:399)
at org.apache.derby.client.net.NetConnection.<init>(NetConnection.java:219)
at org.apache.derby.client.net.NetConnection40.<init>(NetConnection40.java:77)
at org.apache.derby.client.net.ClientJDBCObjectFactoryImpl40.newNetConnection(ClientJDBCObjectFactoryImpl40.java:269)
at org.apache.derby.jdbc.ClientDriver.connect(ClientDriver.java:140)
... 29 more

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Dag H. Wanvik added a comment - 24/Apr/09 05:17 PM
It seems sendMessageWaitForReply is only ever used in

 a) ReplicationMessageTransmit.brokerConnection (from javadoc: "Used to send initiator messages to the slave and receive information about the compatibility of the slave with the master)
 b) MsterController.startFailover (do fail-over from master)

So, I think it should be safe to increase the default of this timeout even to 30 seconds without incurring any penalty.
This is better than making it a knob, I think. I will try to run the entire suite with 30 seconds and see if
it has any detrimental effects.




Knut Anders Hatlen added a comment - 25/Apr/09 05:38 PM
I agree that we should try to avoid adding a knob, and I also think it sounds fine to increase the timeout in those two cases (when starting the master and when initiating fail-over, right?). +1

Dag H. Wanvik added a comment - 27/Apr/09 10:57 PM
Thanks, Knut. Uploading a patch to increase timeout to 30 seconds; it had no ill effects on the ReplicationSuite (as expected, now that we know what it's used for).

Dag H. Wanvik added a comment - 29/Apr/09 12:57 AM
Committed patch as svn 769606, resolving.

Dag H. Wanvik added a comment - 28/Jul/09 09:41 PM
Backported to 10.5 branch as svn 798715, closing.