Derby
  1. Derby
  2. DERBY-4269

Failover did not succeed in 2 min.: testReplication_Local_3_p6_autocommit_OK

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.4.1.3, 10.4.2.0, 10.5.1.1, 10.5.2.0, 10.5.3.0, 10.6.1.0, 10.6.2.1, 10.7.1.1, 10.8.1.2, 10.8.2.2
    • Fix Version/s: 10.7.1.4
    • Component/s: Replication
    • Environment:
    • Urgency:
      Normal
    • Bug behavior facts:
      Crash, Regression Test Failure

      Description

      Failover did not succeed.
      2) testReplication_Local_3_p6_autocommit_OK(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p6)junit.framework.AssertionFailedError: Failover did not succeed.
      at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.connectPing(ReplicationRun.java:270)
      at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p6.derby_3896(ReplicationRun_Local_3_p6.java:200)
      at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p6.testReplication_Local_3_p6_autocommit_OK(ReplicationRun_Local_3_p6.java:86)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:106)
      at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
      at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
      at junit.extensions.TestSetup.run(TestSetup.java:25)

      See http://dbtg.thresher.com/derby/test/Daily/jvm1.4/testing/testlog/vista-64/782274-suitesAll_diff.txt

      1. keys-only.diff
        1 kB
        Knut Anders Hatlen
      2. fix-ignored-runtime-exceptions-instr.diff
        1.0 kB
        Dag H. Wanvik
      3. fix-ignored-runtime-exceptions-b.diff
        2 kB
        Dag H. Wanvik
      4. fix-ignored-runtime-exceptions-a.diff
        3 kB
        Dag H. Wanvik
      5. derby-4269-typo.diff
        0.6 kB
        Dag H. Wanvik
      6. derby-4269-explicit-synch-3.status
        0.8 kB
        Dag H. Wanvik
      7. derby-4269-explicit-synch-3.diff
        9 kB
        Dag H. Wanvik
      8. derby-4269-explicit-synch-2.diff
        1 kB
        Dag H. Wanvik
      9. derby-4269-explicit-synch.diff
        2 kB
        Dag H. Wanvik
      10. DERBY-4269b.diff
        5 kB
        Dag H. Wanvik
      11. DERBY-4269.stat
        0.1 kB
        Dag H. Wanvik
      12. DERBY-4269.diff
        1.0 kB
        Dag H. Wanvik
      13. db_slave-derby.log.gz
        19 kB
        Dag H. Wanvik
      14. db_master-derby.log
        97 kB
        Dag H. Wanvik
      15. 4269-slave-jstack-before-failover.txt
        8 kB
        Dag H. Wanvik
      16. 4269-slave.txt
        6 kB
        Dag H. Wanvik
      17. 4269-master.txt
        5 kB
        Dag H. Wanvik
      18. 4269-client-jstack.txt
        7 kB
        Dag H. Wanvik

        Issue Links

          Activity

          Hide
          Dag H. Wanvik added a comment -

          Triaged for 10.5.2, setting "normal" urgency.

          Show
          Dag H. Wanvik added a comment - Triaged for 10.5.2, setting "normal" urgency.
          Hide
          Myrna van Lunteren added a comment -

          I saw this in nightly testing, with the trunk tree (10.9.0.0 alpha) synced to revision 1167430, on Windows XP with ibm 1.6 (sr9fp1).

          Show
          Myrna van Lunteren added a comment - I saw this in nightly testing, with the trunk tree (10.9.0.0 alpha) synced to revision 1167430, on Windows XP with ibm 1.6 (sr9fp1).
          Hide
          Dag H. Wanvik added a comment -

          Saw this again: http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.5/testing/testlog/lin/1343305-suitesAll_diff.txt

          A bit more of the stack shows the error cause is
          DERBY SQL error: SQLCODE: -1, SQLSTATE: 08004, SQLERRMC: Connection refused to database '/export/home/tmp/jagtmp/autoderbyN_derbyall/suitesAll_1/log/db_slave/wombat' because it is in replication slave mode.

          Looking in the master's derby.log I see these two lines:

          Tue May 29 04:40:11 CEST 2012 Thread[DRDAConnThread_3,5,main] Cleanup action starting
          ERROR XRE20: Failover performed successfully for database '/export/home/tmp/jagtmp/autoderbyN_derbyall/suitesAll_1/log/db_master/wombat', the database has been shutdown.

          note the time: 04:40:11. In the client's derby.log I see:

          Tue May 29 04:40:12 CEST 2012 Thread[DRDAConnThread_3,5,main] (DATABASE = /export/home/tmp/jagtmp/autoderbyN_derbyall/suitesAll_1/log/db_slave/wombat), (DRDAID =

          {5}

          ), Connection refused to database '/export/home/tmp/jagtmp/autoderbyN_derbyall/suitesAll_1/log/db_slave/wombat' because it is in replication slave mode.
          Tue May 29 04:40:12 CEST 2012 Thread[DRDAConnThread_3,5,main] Cleanup action starting
          java.sql.SQLException: Connection refused to database '/export/home/tmp/jagtmp/autoderbyN_derbyall/suitesAll_1/log/db_slave/wombat' because it is in replication slave mode.

          so I wonder if this can be a simple race. In the test, the call to connectPing happens directly after the call to failOver, and I cannot see any provision for waiting and/or trying until it succeeds.

          Show
          Dag H. Wanvik added a comment - Saw this again: http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.5/testing/testlog/lin/1343305-suitesAll_diff.txt A bit more of the stack shows the error cause is DERBY SQL error: SQLCODE: -1, SQLSTATE: 08004, SQLERRMC: Connection refused to database '/export/home/tmp/jagtmp/autoderbyN_derbyall/suitesAll_1/log/db_slave/wombat' because it is in replication slave mode. Looking in the master's derby.log I see these two lines: Tue May 29 04:40:11 CEST 2012 Thread [DRDAConnThread_3,5,main] Cleanup action starting ERROR XRE20: Failover performed successfully for database '/export/home/tmp/jagtmp/autoderbyN_derbyall/suitesAll_1/log/db_master/wombat', the database has been shutdown. note the time: 04:40:11. In the client's derby.log I see: Tue May 29 04:40:12 CEST 2012 Thread [DRDAConnThread_3,5,main] (DATABASE = /export/home/tmp/jagtmp/autoderbyN_derbyall/suitesAll_1/log/db_slave/wombat), (DRDAID = {5} ), Connection refused to database '/export/home/tmp/jagtmp/autoderbyN_derbyall/suitesAll_1/log/db_slave/wombat' because it is in replication slave mode. Tue May 29 04:40:12 CEST 2012 Thread [DRDAConnThread_3,5,main] Cleanup action starting java.sql.SQLException: Connection refused to database '/export/home/tmp/jagtmp/autoderbyN_derbyall/suitesAll_1/log/db_slave/wombat' because it is in replication slave mode. so I wonder if this can be a simple race. In the test, the call to connectPing happens directly after the call to failOver, and I cannot see any provision for waiting and/or trying until it succeeds.
          Hide
          Dag H. Wanvik added a comment -

          Attaching a simple patch which inserts a call to waitForConnect between the failOver and the connectPing.

          This should make sure the slave has enough time to fail over before the call to connectPing is made, if that is indeed the issue here. I see this pattern has been used elsewhere, e.g. in ReplicationRun_Local_showStateChange#testReplication_Local_showStateChange_showReplState.

          The test ReplicationRun_Local_3_p6 runs OK with the patch.

          Show
          Dag H. Wanvik added a comment - Attaching a simple patch which inserts a call to waitForConnect between the failOver and the connectPing. This should make sure the slave has enough time to fail over before the call to connectPing is made, if that is indeed the issue here. I see this pattern has been used elsewhere, e.g. in ReplicationRun_Local_showStateChange#testReplication_Local_showStateChange_showReplState. The test ReplicationRun_Local_3_p6 runs OK with the patch.
          Hide
          Kristian Waagan added a comment -

          Hi Dag,

          Is the patch itself missing?

          Show
          Kristian Waagan added a comment - Hi Dag, Is the patch itself missing?
          Hide
          Dag H. Wanvik added a comment -

          Thanks, Kristian. Fixed.

          Show
          Dag H. Wanvik added a comment - Thanks, Kristian. Fixed.
          Hide
          Kristian Waagan added a comment -

          Thanks, Dag.

          The change looks good to me, +1.
          I see that two different values are used for the number of retries; 10 (as in your patch) and 200. I haven't checked if the test is waiting for something it knows takes a long time when the higher number of retries is used.

          Show
          Kristian Waagan added a comment - Thanks, Dag. The change looks good to me, +1. I see that two different values are used for the number of retries; 10 (as in your patch) and 200. I haven't checked if the test is waiting for something it knows takes a long time when the higher number of retries is used.
          Hide
          Knut Anders Hatlen added a comment -

          It looks good to me too. Do we need similar changes for the other occurrences of this pattern too? Maybe we could make pingConnect() as tolerant as waitForConnect(), so that no changes are needed in the test cases themselves? (The former only tolerate 08004 errors while waiting for the server to come up, whereas the latter accepts all errors.)

          Show
          Knut Anders Hatlen added a comment - It looks good to me too. Do we need similar changes for the other occurrences of this pattern too? Maybe we could make pingConnect() as tolerant as waitForConnect(), so that no changes are needed in the test cases themselves? (The former only tolerate 08004 errors while waiting for the server to come up, whereas the latter accepts all errors.)
          Hide
          Dag H. Wanvik added a comment -

          Thanks, Kristian & Knut! I did notice other place which did not use this pattern, too. Making pincConnect more resilient might be good,
          unless there are use cases where changed behavior is not desired. I'll have a look.

          Show
          Dag H. Wanvik added a comment - Thanks, Kristian & Knut! I did notice other place which did not use this pattern, too. Making pincConnect more resilient might be good, unless there are use cases where changed behavior is not desired. I'll have a look.
          Hide
          Dag H. Wanvik added a comment - - edited

          Looking again, I think my first analysis was wrong, perhaps I looked at the wrong derby.log files.

          Attaching here

          fail/Embedded_30/ReplicationRun_Local_3_p6/testReplication_Local_3_p6_autocommit_OK/db_slave-derby.log and
          fail/Embedded_30/ReplicationRun_Local_3_p6/testReplication_Local_3_p6_autocommit_OK/db_master-derby.log

          which show:

          • Master sees successful failover at time Tue May 29 04:37:03 (line 1175 i master log).
          • Slave reports "Failover perfomed successfully" at time Tue May 29 04:37:03 (line 28 in slave log).

          The we see two minutes of failed connection attempt in slave log of this type:

          "Connection refused to database '.... db_slave/wombat' because it is in replication slave mode"

          Finally at time Tue May 29 04:39:08, i.e. two minutes later, the slave
          and master are shut down:

          "Shutting down Derby engine\nReplication slave role was stopped for database '... /db_slave/wombat'
          "Shutting down Derby engine" (master)

          because the two minutes of attempts in connectPing to connect to the
          slave, supposedly failed over, bails out (replicationRun, line 379).

          So the question becomes, why can't the client connect to the failed
          over slave after two minutes? Both the master and the slave have
          reported on their logs that failover happened, as far as I understand
          the logs.

          Show
          Dag H. Wanvik added a comment - - edited Looking again, I think my first analysis was wrong, perhaps I looked at the wrong derby.log files. Attaching here fail/Embedded_30/ReplicationRun_Local_3_p6/testReplication_Local_3_p6_autocommit_OK/db_slave-derby.log and fail/Embedded_30/ReplicationRun_Local_3_p6/testReplication_Local_3_p6_autocommit_OK/db_master-derby.log which show: Master sees successful failover at time Tue May 29 04:37:03 (line 1175 i master log). Slave reports "Failover perfomed successfully" at time Tue May 29 04:37:03 (line 28 in slave log). The we see two minutes of failed connection attempt in slave log of this type: "Connection refused to database '.... db_slave/wombat' because it is in replication slave mode" Finally at time Tue May 29 04:39:08, i.e. two minutes later, the slave and master are shut down: "Shutting down Derby engine\nReplication slave role was stopped for database '... /db_slave/wombat' "Shutting down Derby engine" (master) because the two minutes of attempts in connectPing to connect to the slave, supposedly failed over, bails out (replicationRun, line 379). So the question becomes, why can't the client connect to the failed over slave after two minutes? Both the master and the slave have reported on their logs that failover happened, as far as I understand the logs.
          Hide
          Dag H. Wanvik added a comment -

          The message "Failover perfomed successfully" (sic) is issued after a call in the slave to LogToFile.failoverSlave, cf line 301 in SlaveController#doFailover,
          so that's how far we know the slave got.

          Show
          Dag H. Wanvik added a comment - The message "Failover perfomed successfully" (sic) is issued after a call in the slave to LogToFile.failoverSlave, cf line 301 in SlaveController#doFailover, so that's how far we know the slave got.
          Hide
          Dag H. Wanvik added a comment - - edited

          Uploading a patch for the typo in the slave's failover message: derby-4269-typo.diff.

          Show
          Dag H. Wanvik added a comment - - edited Uploading a patch for the typo in the slave's failover message: derby-4269-typo.diff.
          Hide
          Dag H. Wanvik added a comment -

          Committed typo patch as svn 1345272.

          Show
          Dag H. Wanvik added a comment - Committed typo patch as svn 1345272.
          Hide
          Dag H. Wanvik added a comment -

          The call to LogToFile.failoverSlave does the following: flushed received log records to disk, and then does a notify on slaveRecoveryMonitor, i.e.
          it doesn't wait for the recovery to finish. So it seems the slave has gotten stuck in the recovery somehow, thus making connection attempts fail for two minutes.

          Show
          Dag H. Wanvik added a comment - The call to LogToFile.failoverSlave does the following: flushed received log records to disk, and then does a notify on slaveRecoveryMonitor, i.e. it doesn't wait for the recovery to finish. So it seems the slave has gotten stuck in the recovery somehow, thus making connection attempts fail for two minutes.
          Hide
          Dag H. Wanvik added a comment -

          FWIW, I was able to boot the slave database without incident.

          Show
          Dag H. Wanvik added a comment - FWIW, I was able to boot the slave database without incident.
          Hide
          Dag H. Wanvik added a comment -

          Did tight runs of this test over the week-end. I was able to reproduce in 2 of 13000 attempts.

          Show
          Dag H. Wanvik added a comment - Did tight runs of this test over the week-end. I was able to reproduce in 2 of 13000 attempts.
          Hide
          Dag H. Wanvik added a comment -

          After a couple of more days, I made it hang by not timing out the ping if it doesn't succeed. Uploading jstacks of the three involved processes:

          • test client, master and slave
          Show
          Dag H. Wanvik added a comment - After a couple of more days, I made it hang by not timing out the ping if it doesn't succeed. Uploading jstacks of the three involved processes: test client, master and slave
          Hide
          Dag H. Wanvik added a comment -

          The only thing that strikes me at first blush is that the slave still has a "ping" thread running:

          "derby.slave.ping-/export/home/tmp/dag/zb/derby-4269-keepwaiting-replicationTests.ReplicationRun_Local_3_p6-sb1.jars.sane-1.7.0_04-25804/db_slave/wombat" daemon prio=3 tid=0x0873c400 nid=0x13 in Object.wait() [0xd409e000]
          java.lang.Thread.State: WAITING (on object monitor)
          at java.lang.Object.wait(Native Method)

          • waiting on <0xf0ff2508> (a java.lang.Object)
            at java.lang.Object.wait(Object.java:503)
            at org.apache.derby.impl.store.replication.net.ReplicationMessageReceive$SlavePingThread.run(ReplicationMessageReceive.java:535)
          • locked <0xf0ff2508> (a java.lang.Object)

          but I am not sure that is relevant.

          Show
          Dag H. Wanvik added a comment - The only thing that strikes me at first blush is that the slave still has a "ping" thread running: "derby.slave.ping-/export/home/tmp/dag/zb/derby-4269-keepwaiting-replicationTests.ReplicationRun_Local_3_p6-sb1.jars.sane-1.7.0_04-25804/db_slave/wombat" daemon prio=3 tid=0x0873c400 nid=0x13 in Object.wait() [0xd409e000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) waiting on <0xf0ff2508> (a java.lang.Object) at java.lang.Object.wait(Object.java:503) at org.apache.derby.impl.store.replication.net.ReplicationMessageReceive$SlavePingThread.run(ReplicationMessageReceive.java:535) locked <0xf0ff2508> (a java.lang.Object) but I am not sure that is relevant.
          Hide
          Dag H. Wanvik added a comment -

          Trying to connect to the slave maually via ij, i get confirmation it hasn't recovered:

          ij> connect 'jdbc:derby://localhost:1531/wombat';
          ERROR 08004: DERBY SQL error: SQLCODE: -1, SQLSTATE: 08004, SQLERRMC: Connection refused to database '/export/home/tmp/dag/zb/derby-4269-keepwaiting-replicationTests.ReplicationRun_Local_3_p6-sb1.jars.sane-1.7.0_04-25804/db_slave/wombat' because it is in replication slave mode.

          Show
          Dag H. Wanvik added a comment - Trying to connect to the slave maually via ij, i get confirmation it hasn't recovered: ij> connect 'jdbc:derby://localhost:1531/wombat'; ERROR 08004: DERBY SQL error: SQLCODE: -1, SQLSTATE: 08004, SQLERRMC: Connection refused to database '/export/home/tmp/dag/zb/derby-4269-keepwaiting-replicationTests.ReplicationRun_Local_3_p6-sb1.jars.sane-1.7.0_04-25804/db_slave/wombat' because it is in replication slave mode.
          Hide
          Dag H. Wanvik added a comment -

          Attaching a jstack of the slave threads as they look immediately before we initiate failover in line 194 of ReplicationRun_Local_3_p6.

          Note that the threads for the SlaveController and SlaveDatabaseBootThread are (still) seen, they have vanished in the stacks of the slave process!

          Show
          Dag H. Wanvik added a comment - Attaching a jstack of the slave threads as they look immediately before we initiate failover in line 194 of ReplicationRun_Local_3_p6. Note that the threads for the SlaveController and SlaveDatabaseBootThread are (still) seen, they have vanished in the stacks of the slave process!
          Hide
          Dag H. Wanvik added a comment -

          In a successful failover, I see that the SlavePingThread lives on even after a successful connect, so thas was a red herring. More interesting i the fact that the thread doing recovery has gone, but apparently, the database still thinks it's in a slave state. I'll have a look at that phase transition.

          Show
          Dag H. Wanvik added a comment - In a successful failover, I see that the SlavePingThread lives on even after a successful connect, so thas was a red herring. More interesting i the fact that the thread doing recovery has gone, but apparently, the database still thinks it's in a slave state. I'll have a look at that phase transition.
          Hide
          Dag H. Wanvik added a comment -

          In SlaveDataBase, the variable "inReplicationSlaveMode" holds the state. The error message 08004.C.7 ("Connection refused to database '

          {0}

          ' because it is in replication slave mode") i only every issued in two locations, both inside SlaveDataBase, in methods "setupConnection" and "getAuthenticationService".

          In both cases, they test the variable inReplicationSlaveMode. Looking at the booting (recovery) thread, the variable is reset to false ca line 317, inside the thread SlaveDatabaseBootThread:

          try {
          :
          bootBasicDatabase(create, params); // will be blocked

          // if we get here, failover has been called and the
          // database can now be connected to
          inReplicationSlaveMode = false;
          :
          } catch (StandardException se)

          { // We get here when SlaveController#stopSlave has been // called, or if a fatal exception has been thrown. handleShutdown(se); }

          Now, the thread is gone, but in our cae the variable is still true, so I believe the boot must have thrown a (runtime?) exception for this to occur. Notice we only catch StandardException, but any runetime exception would go unnoticed and the thread would terminate.
          I'll try to instrument this code to log any exceptions on derby.log before the thread dies.

          Show
          Dag H. Wanvik added a comment - In SlaveDataBase, the variable "inReplicationSlaveMode" holds the state. The error message 08004.C.7 ("Connection refused to database ' {0} ' because it is in replication slave mode") i only every issued in two locations, both inside SlaveDataBase, in methods "setupConnection" and "getAuthenticationService". In both cases, they test the variable inReplicationSlaveMode. Looking at the booting (recovery) thread, the variable is reset to false ca line 317, inside the thread SlaveDatabaseBootThread: try { : bootBasicDatabase(create, params); // will be blocked // if we get here, failover has been called and the // database can now be connected to inReplicationSlaveMode = false; : } catch (StandardException se) { // We get here when SlaveController#stopSlave has been // called, or if a fatal exception has been thrown. handleShutdown(se); } Now, the thread is gone, but in our cae the variable is still true, so I believe the boot must have thrown a (runtime?) exception for this to occur. Notice we only catch StandardException, but any runetime exception would go unnoticed and the thread would terminate. I'll try to instrument this code to log any exceptions on derby.log before the thread dies.
          Hide
          Knut Anders Hatlen added a comment -

          If you set derby.tests.repltrace to true, the output from the sub-processes will be printed (see ReplicationRun.processDEBUGOutput(), called from runUserCommandLocally()). I assume an uncaught RuntimeException would have been printed to stderr in the sub-process, so it should be possible to see it with the tracing property set.

          Show
          Knut Anders Hatlen added a comment - If you set derby.tests.repltrace to true, the output from the sub-processes will be printed (see ReplicationRun.processDEBUGOutput(), called from runUserCommandLocally()). I assume an uncaught RuntimeException would have been printed to stderr in the sub-process, so it should be possible to see it with the tracing property set.
          Hide
          Dag H. Wanvik added a comment -

          Yes, I considered that, but was afraid any earlier printing might change the timing, so I opted for a minimum change which only kicks in after the error has happened We'll see if I catch any fish..

          Show
          Dag H. Wanvik added a comment - Yes, I considered that, but was afraid any earlier printing might change the timing, so I opted for a minimum change which only kicks in after the error has happened We'll see if I catch any fish..
          Hide
          Dag H. Wanvik added a comment -

          I caught the exception, and this is what I see, it was an NPE allright:

          ---- BEGIN REPLICATION ERROR MESSAGE (6/8/12 9:07 PM) ----
          Slave recovery: Unexpected exception, thread dies : [0] /export/home/tmp/dag/zb/derby-4269-keepwaiting-replicationTe\
          sts.ReplicationRun_Local_3_p6-sb1.jars.sane-1.7.0_04-11053/db_slave/wombat
          java.lang.NullPointerException
          at java.util.Hashtable.put(Hashtable.java:432)
          at org.apache.derby.iapi.util.DoubleProperties.propertyNames(DoubleProperties.java:70)
          at org.apache.derby.impl.services.monitor.BaseMonitor.getImplementations(BaseMonitor.java:1065)
          at org.apache.derby.impl.services.monitor.BaseMonitor.loadInstance(BaseMonitor.java:718)
          at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:323)
          at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:542)
          at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:427)
          at org.apache.derby.impl.db.BasicDatabase.bootClassFactory(BasicDatabase.java:731)
          at org.apache.derby.impl.db.BasicDatabase.boot(BasicDatabase.java:193)
          at org.apache.derby.impl.db.SlaveDatabase.bootBasicDatabase(SlaveDatabase.java:446)
          at org.apache.derby.impl.db.SlaveDatabase.access$000(SlaveDatabase.java:72)
          at org.apache.derby.impl.db.SlaveDatabase$SlaveDatabaseBootThread.run(SlaveDatabase.java:314)
          at java.lang.Thread.run(Thread.java:722)

          -------------------- END REPLICATION ERROR MESSAGE ---------------------

          Show
          Dag H. Wanvik added a comment - I caught the exception, and this is what I see, it was an NPE allright: ---- BEGIN REPLICATION ERROR MESSAGE (6/8/12 9:07 PM) ---- Slave recovery: Unexpected exception, thread dies : [0] /export/home/tmp/dag/zb/derby-4269-keepwaiting-replicationTe\ sts.ReplicationRun_Local_3_p6-sb1.jars.sane-1.7.0_04-11053/db_slave/wombat java.lang.NullPointerException at java.util.Hashtable.put(Hashtable.java:432) at org.apache.derby.iapi.util.DoubleProperties.propertyNames(DoubleProperties.java:70) at org.apache.derby.impl.services.monitor.BaseMonitor.getImplementations(BaseMonitor.java:1065) at org.apache.derby.impl.services.monitor.BaseMonitor.loadInstance(BaseMonitor.java:718) at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:323) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:542) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:427) at org.apache.derby.impl.db.BasicDatabase.bootClassFactory(BasicDatabase.java:731) at org.apache.derby.impl.db.BasicDatabase.boot(BasicDatabase.java:193) at org.apache.derby.impl.db.SlaveDatabase.bootBasicDatabase(SlaveDatabase.java:446) at org.apache.derby.impl.db.SlaveDatabase.access$000(SlaveDatabase.java:72) at org.apache.derby.impl.db.SlaveDatabase$SlaveDatabaseBootThread.run(SlaveDatabase.java:314) at java.lang.Thread.run(Thread.java:722) -------------------- END REPLICATION ERROR MESSAGE ---------------------
          Hide
          Knut Anders Hatlen added a comment -

          The failing code in DoubleProperties.propertyNames():

          66 if (write != null) {
          67
          68 for (Enumeration e = write.propertyNames(); e.hasMoreElements(); )

          { 69 String key = (String) e.nextElement(); 70 p.put(key, write.getProperty(key)); 71 }

          72 }

          One of the arguments to Hastable.put() is null. I think this can only happen if some other thread is modifying the "write" instance at the same time as this code runs.

          Show
          Knut Anders Hatlen added a comment - The failing code in DoubleProperties.propertyNames(): 66 if (write != null) { 67 68 for (Enumeration e = write.propertyNames(); e.hasMoreElements(); ) { 69 String key = (String) e.nextElement(); 70 p.put(key, write.getProperty(key)); 71 } 72 } One of the arguments to Hastable.put() is null. I think this can only happen if some other thread is modifying the "write" instance at the same time as this code runs.
          Hide
          Dag H. Wanvik added a comment - - edited

          Yes, that was my conclusion too. The usage of property lists is extremely convoluted in the Derby boot process, and it was very hard to understand what's going on. And since this goes on in a forked VM, using a debugger is tricky without affecting the timing... I replaced all usages of java.util.Properties with a specialized version (DerbyProperties), so I could track all allocations and modifications of properties. Adding to the problem is the fact that Properties can have default Properties lists recursively, and this is being used. In addition Derby adds two layers of specializations: UpdateServiceProperties and DoubleProperties.

          Mostly, property lists are passed into constructors without any cloning of them so the number of live aliases is high.

          It turns out the layering at the time of the NPE in the boot is like this:

          DoubleProperties#write instance of
          UpdateServiceProperties#defaults instance of
          Properties#defaults instance of
          Properties (no default) <----- race on this one

          The bottom one is the one that's being changed under our feet in these lines in EmbedConnection#bootDatabase:

          :
          // clear these values as some modules hang onto
          // the properties set corresponding to service.properties
          // and they shouldn't be interested in these JDBC attributes.
          info.clear();

          It is initially allocated in EmbedConnection#filterProperties:

          private Properties filterProperties(Properties inputSet) {
          Properties limited = new org.apache.derby.iapi.util.DerbyProperties();
          :

          assigning it to "info".

          The thread that is doing this is the thread that initially started the slave operation (cf. URL "startSlave=true"). This thread hangs until the master has been started and replication is underway[1]. When the master comes online this thread the connect returns with XRE08 ("REPLICATION_SLAVE_STARTED_OK").

          The failover happens in this test almost immediately, and the SlaveDatabaseBootThread then starts moving again (it i waiting for more log records to replay or failover).

          Unfortunately, in the current scenario, the thread doing "startSlave=true" is still active, normally it would have finished long before SlaveDatabaseBootThread starts moving again.

          The Property default at the 4th level above would normally be empty, since its being cleared by the "startSlave=true" thread in the "info.clear". Bit since it's lagging behind, it only does this "clear" when the SlaveDatabaseBootThread has started its boot, leading to the contention.

          [1] In EmbedConnection.bootDatabase, this corresponds to the line:

          // try to start the service if it doesn't already exist
          if (!Monitor.startPersistentService(dbname, info)) {

          Show
          Dag H. Wanvik added a comment - - edited Yes, that was my conclusion too. The usage of property lists is extremely convoluted in the Derby boot process, and it was very hard to understand what's going on. And since this goes on in a forked VM, using a debugger is tricky without affecting the timing... I replaced all usages of java.util.Properties with a specialized version (DerbyProperties), so I could track all allocations and modifications of properties. Adding to the problem is the fact that Properties can have default Properties lists recursively, and this is being used. In addition Derby adds two layers of specializations: UpdateServiceProperties and DoubleProperties. Mostly, property lists are passed into constructors without any cloning of them so the number of live aliases is high. It turns out the layering at the time of the NPE in the boot is like this: DoubleProperties#write instance of UpdateServiceProperties#defaults instance of Properties#defaults instance of Properties (no default) <----- race on this one The bottom one is the one that's being changed under our feet in these lines in EmbedConnection#bootDatabase: : // clear these values as some modules hang onto // the properties set corresponding to service.properties // and they shouldn't be interested in these JDBC attributes. info.clear(); It is initially allocated in EmbedConnection#filterProperties: private Properties filterProperties(Properties inputSet) { Properties limited = new org.apache.derby.iapi.util.DerbyProperties(); : assigning it to "info". The thread that is doing this is the thread that initially started the slave operation (cf. URL "startSlave=true"). This thread hangs until the master has been started and replication is underway [1] . When the master comes online this thread the connect returns with XRE08 ("REPLICATION_SLAVE_STARTED_OK"). The failover happens in this test almost immediately, and the SlaveDatabaseBootThread then starts moving again (it i waiting for more log records to replay or failover). Unfortunately, in the current scenario, the thread doing "startSlave=true" is still active, normally it would have finished long before SlaveDatabaseBootThread starts moving again. The Property default at the 4th level above would normally be empty, since its being cleared by the "startSlave=true" thread in the "info.clear". Bit since it's lagging behind, it only does this "clear" when the SlaveDatabaseBootThread has started its boot, leading to the contention. [1] In EmbedConnection.bootDatabase, this corresponds to the line: // try to start the service if it doesn't already exist if (!Monitor.startPersistentService(dbname, info)) {
          Hide
          Dag H. Wanvik added a comment -

          This may not be a problem in production, but it may make several of our replication tests fail, so I am changing priority to Major.

          Show
          Dag H. Wanvik added a comment - This may not be a problem in production, but it may make several of our replication tests fail, so I am changing priority to Major.
          Hide
          Dag H. Wanvik added a comment - - edited

          Attaching a patch which skips the cleaning action iff we are booting a slave. It makes the test work. I don't believe it will cause any problem, but we'll see when the regression come in.

          An alternative way to fix this would be some kind of extra synchronization to hold back the recovery thread till the boot thread has returned control to the client.

          Show
          Dag H. Wanvik added a comment - - edited Attaching a patch which skips the cleaning action iff we are booting a slave. It makes the test work. I don't believe it will cause any problem, but we'll see when the regression come in. An alternative way to fix this would be some kind of extra synchronization to hold back the recovery thread till the boot thread has returned control to the client.
          Hide
          Dag H. Wanvik added a comment -

          Regressions passed. Please review.

          Show
          Dag H. Wanvik added a comment - Regressions passed. Please review.
          Hide
          Knut Anders Hatlen added a comment -

          Looks like a reasonable workaround for the bug, as the call to clear() seems not to be strictly required in any of the cases.

          The alternative approach of adding extra synchronization may be difficult, unless the Properties objects are replaced with some other data structure. At least the straightforward fix of enclosing the loop in DoubleProperties in a synchronized block on "write" would probably not suffice, because the multi-level nesting of Properties objects would still make it possible for other threads to access one of the underlying Properties objects directly and bypass the top-level synchronization.

          Show
          Knut Anders Hatlen added a comment - Looks like a reasonable workaround for the bug, as the call to clear() seems not to be strictly required in any of the cases. The alternative approach of adding extra synchronization may be difficult, unless the Properties objects are replaced with some other data structure. At least the straightforward fix of enclosing the loop in DoubleProperties in a synchronized block on "write" would probably not suffice, because the multi-level nesting of Properties objects would still make it possible for other threads to access one of the underlying Properties objects directly and bypass the top-level synchronization.
          Hide
          Dag H. Wanvik added a comment -

          Thanks, Knut. Yes, you are right, we'd need some new mechanism for synchronizing this since the access is to the bottom-most property list.

          Show
          Dag H. Wanvik added a comment - Thanks, Knut. Yes, you are right, we'd need some new mechanism for synchronizing this since the access is to the bottom-most property list.
          Hide
          Dag H. Wanvik added a comment -

          Uploading an explicit synchronization fix for this issue. This will have the clearing of the properties wait till the boot thread is fully done by augmenting the predicate in the method verifySuccessfulBoot. This also made my instrumented Derby (which fails every time otherwise), work correctly for the test.

          I think I would prefer this patch to the previous one since it retains the original inten of the "info.clear" call.

          Show
          Dag H. Wanvik added a comment - Uploading an explicit synchronization fix for this issue. This will have the clearing of the properties wait till the boot thread is fully done by augmenting the predicate in the method verifySuccessfulBoot. This also made my instrumented Derby (which fails every time otherwise), work correctly for the test. I think I would prefer this patch to the previous one since it retains the original inten of the "info.clear" call.
          Hide
          Dag H. Wanvik added a comment -

          By way of explanation: the analysis in a comment above is slightly wrong, sml the call to Monitor.startPersistentService (footnote [1] above). This call will actually hang till the failover is almost done, but not quite, as evidenced in the bug. The reason the new patch wprks is that it makes sure the boot thread is really finished before allowing the call [1] in the initiating thread to complete.

          Show
          Dag H. Wanvik added a comment - By way of explanation: the analysis in a comment above is slightly wrong, sml the call to Monitor.startPersistentService (footnote [1] above). This call will actually hang till the failover is almost done, but not quite, as evidenced in the bug. The reason the new patch wprks is that it makes sure the boot thread is really finished before allowing the call [1] in the initiating thread to complete.
          Hide
          Knut Anders Hatlen added a comment -

          I agree that a fix that's local to SlaveDatabase is cleaner.

          To reduce the amount of state we need to maintain in the class, we could perhaps pass sdbThread as an argument to verifySuccessfulBoot(), and call isAlive() on it instead of checking the new volatile flag.

          Show
          Knut Anders Hatlen added a comment - I agree that a fix that's local to SlaveDatabase is cleaner. To reduce the amount of state we need to maintain in the class, we could perhaps pass sdbThread as an argument to verifySuccessfulBoot(), and call isAlive() on it instead of checking the new volatile flag.
          Hide
          Dag H. Wanvik added a comment -

          Yup, that's even better. Uploading rev #2 of this patch, "derby-4269-explicit-synch-2" using that idea. Rerunning all regressions.

          Show
          Dag H. Wanvik added a comment - Yup, that's even better. Uploading rev #2 of this patch, "derby-4269-explicit-synch-2" using that idea. Rerunning all regressions.
          Hide
          Knut Anders Hatlen added a comment -

          Did you forget to upload the patch? I don't see it.

          Show
          Knut Anders Hatlen added a comment - Did you forget to upload the patch? I don't see it.
          Hide
          Dag H. Wanvik added a comment -

          Yes, apparently I did. Here it is for completeness.

          Unfortunately, explicit synchronization breaks other replication tests. The new extra waiting in "verifySuccessfulboot" makes the boot of the slavedatabase (minus the actual fail-over part to make it a new master) incomplete until fail-over time, cf. the call to startPersistentService mentioned above will now wait longer.

          This makes other connection attempts to the slave hang (the service isn't started, so the new thread tries to start it again), whereas previously a new thread returned an error indicating that the slave was still in slave state. So, I am back to the first patch, which skips the clearing of the properties for slave databases.

          Show
          Dag H. Wanvik added a comment - Yes, apparently I did. Here it is for completeness. Unfortunately, explicit synchronization breaks other replication tests. The new extra waiting in "verifySuccessfulboot" makes the boot of the slavedatabase (minus the actual fail-over part to make it a new master) incomplete until fail-over time, cf. the call to startPersistentService mentioned above will now wait longer. This makes other connection attempts to the slave hang (the service isn't started, so the new thread tries to start it again), whereas previously a new thread returned an error indicating that the slave was still in slave state. So, I am back to the first patch, which skips the clearing of the properties for slave databases.
          Hide
          Dag H. Wanvik added a comment -

          After this point in the SlaveController it is unsafe to access the properties iff they are being cleared:

          SlaveController#startSlave, line 254: "startupSuccessful = true;" because past that point, the connection thread is free to proceed and clear the properties, since the last criterion of verifySuccessfulBoot (boot waits for slave operation to be initiated here) is fulfilled.

          Show
          Dag H. Wanvik added a comment - After this point in the SlaveController it is unsafe to access the properties iff they are being cleared: SlaveController#startSlave, line 254: "startupSuccessful = true;" because past that point, the connection thread is free to proceed and clear the properties, since the last criterion of verifySuccessfulBoot (boot waits for slave operation to be initiated here) is fulfilled.
          Hide
          Dag H. Wanvik added a comment -

          Attaching derby-4269-explicit-synch-3. This uses a binary semaphore to avoid concurrent aaccess of the "info" Properties object by the two contending threads.

          EmbedConnection#bootDatabase grabs the semaphore before doing the call on "info.clear" and releases it after.

          The SlaveDatabaseBootThread grabs it immediately after recovery has been successfully completed after a fail-over (in LogToFile#recover) and releases it when the SlaveDatabaseBootThread is done.

          This patch passes regressions and also works with my instrumented version of Derby which reliably reproduces the NPE seen in this issue.

          Unfortunately, it adds a bit for clutter since we need to add the P&V operations through three levels:

          SlaveDatabase
          SlaveFactory/SlaveController
          LogToFile

          but I didn't find a better way.

          Show
          Dag H. Wanvik added a comment - Attaching derby-4269-explicit-synch-3. This uses a binary semaphore to avoid concurrent aaccess of the "info" Properties object by the two contending threads. EmbedConnection#bootDatabase grabs the semaphore before doing the call on "info.clear" and releases it after. The SlaveDatabaseBootThread grabs it immediately after recovery has been successfully completed after a fail-over (in LogToFile#recover) and releases it when the SlaveDatabaseBootThread is done. This patch passes regressions and also works with my instrumented version of Derby which reliably reproduces the NPE seen in this issue. Unfortunately, it adds a bit for clutter since we need to add the P&V operations through three levels: SlaveDatabase SlaveFactory/SlaveController LogToFile but I didn't find a better way.
          Hide
          Knut Anders Hatlen added a comment -

          Since the underlying NullPointerException happens because we try to put a null value into a Properties object in DoubleProperties.propertyNames(), but we never actually use the property value (the method returns p.keys()), I'm wondering if the attached patch would fix the problem. It makes propertyNames() use an intermediate Vector instance instead of a Properties instance, so that it never needs to access the property value.

          Show
          Knut Anders Hatlen added a comment - Since the underlying NullPointerException happens because we try to put a null value into a Properties object in DoubleProperties.propertyNames(), but we never actually use the property value (the method returns p.keys()), I'm wondering if the attached patch would fix the problem. It makes propertyNames() use an intermediate Vector instance instead of a Properties instance, so that it never needs to access the property value.
          Hide
          Dag H. Wanvik added a comment -

          Yes, it would remove the NPE, I agree. But do we have any guarantee that the returned keys aren't used to access values (and if so tat those are assumed to be non-null) elsewhere? It may well be that that is not the case, and that this solution is ok. Still, I think the shared use of the properties is unsafe...

          Show
          Dag H. Wanvik added a comment - Yes, it would remove the NPE, I agree. But do we have any guarantee that the returned keys aren't used to access values (and if so tat those are assumed to be non-null) elsewhere? It may well be that that is not the case, and that this solution is ok. Still, I think the shared use of the properties is unsafe...
          Hide
          Knut Anders Hatlen added a comment -

          FWIW, all the regression tests ran cleanly with the keys-only.diff patch. Even if it doesn't fix the bug, I think it's an improvement as it slightly simplifies the code.

          > But do we have any guarantee that the returned keys aren't used to access values (and if so tat those are assumed to be non-null) elsewhere?

          I don't know. However, what caused this particular bug, was that the override of Properties.propertyNames() itself wasn't safe to use, so that it didn't give the thread-safety guarantees that one would normally expect from Properties' methods. Callers of Properties' methods should already be aware of the limitations of the synchronization guarantees provided by Properties (for example that there is a possibility that getProperty() returns null, as there is no synchronization preventing the property from getting removed after a check for the existence of a key). If they don't take that into consideration, it's a bug that needs to be fixed. But in cases where the actual bug lies in the override of a Properties method, like in this case, one cannot reasonably expect the callers to handle the problem gracefully.

          So although I cannot guarantee that all users of Properties instances in the engine are free from bugs, I believe that the Properties class does provide the functionality to be used in a thread-safe way, even without adding a semaphore that guards the call to info.clear().

          Show
          Knut Anders Hatlen added a comment - FWIW, all the regression tests ran cleanly with the keys-only.diff patch. Even if it doesn't fix the bug, I think it's an improvement as it slightly simplifies the code. > But do we have any guarantee that the returned keys aren't used to access values (and if so tat those are assumed to be non-null) elsewhere? I don't know. However, what caused this particular bug, was that the override of Properties.propertyNames() itself wasn't safe to use, so that it didn't give the thread-safety guarantees that one would normally expect from Properties' methods. Callers of Properties' methods should already be aware of the limitations of the synchronization guarantees provided by Properties (for example that there is a possibility that getProperty() returns null, as there is no synchronization preventing the property from getting removed after a check for the existence of a key). If they don't take that into consideration, it's a bug that needs to be fixed. But in cases where the actual bug lies in the override of a Properties method, like in this case, one cannot reasonably expect the callers to handle the problem gracefully. So although I cannot guarantee that all users of Properties instances in the engine are free from bugs, I believe that the Properties class does provide the functionality to be used in a thread-safe way, even without adding a semaphore that guards the call to info.clear().
          Hide
          Knut Anders Hatlen added a comment -

          By the way, I don't think the keys-only patch is quite correct. Now that it's using a Vector, there's no elimination of duplicates, and it might return duplicate keys if the read and write fields contain overlapping properties.

          Show
          Knut Anders Hatlen added a comment - By the way, I don't think the keys-only patch is quite correct. Now that it's using a Vector, there's no elimination of duplicates, and it might return duplicate keys if the read and write fields contain overlapping properties.
          Hide
          Dag H. Wanvik added a comment -

          Knut: <good explanation of the thread-safety guarantees of Properties>
          Part of the problem here is that the old (i.e. pre-replication code) might not be prepared for thread concurrency on the Properties object, viz the fact the buggy overload wasn't detected before.
          Still, I am tempted to go for your solution (modulo the weakness you pointed out) to get rid of the ugly explicit synchronization, and wrong use issues if/when they surface. My fear is, though, that could be hard to track down.. so the explicit synch is safer if we can't be sure how this is used. I'' try to analyze the usages of the overloaded propertyNames..

          Show
          Dag H. Wanvik added a comment - Knut: <good explanation of the thread-safety guarantees of Properties> Part of the problem here is that the old (i.e. pre-replication code) might not be prepared for thread concurrency on the Properties object, viz the fact the buggy overload wasn't detected before. Still, I am tempted to go for your solution (modulo the weakness you pointed out) to get rid of the ugly explicit synchronization, and wrong use issues if/when they surface. My fear is, though, that could be hard to track down.. so the explicit synch is safer if we can't be sure how this is used. I'' try to analyze the usages of the overloaded propertyNames..
          Hide
          Knut Anders Hatlen added a comment -

          I logged DERBY-5830 for fixing DoubleProperties.propertyNames(), so we can separate that issue from any changes that need to be done to the slave boot code.

          So far, I think we have discussed three areas that potentially need fixing:

          1) Lack of thread-safety in DoubleProperties - now handled in DERBY-5830.

          2) Boot code not prepared for being multi-threaded. The fix for (1) may be sufficient, but there may be other hidden bugs caused by the slave boot process breaking assumptions made in existing code.

          3) SlaveDatabaseBootThread silently ignores runtime exceptions and leaves the process hanging. The fix for (1) may remove the cause for the hang we've seen, but SlaveDatabaseBootThread still doesn't handle unexpected failures gracefully.

          I'm not sure what's the best way to approach (2). It's difficult to see from the code exactly which places the Properties instance in question is used. Maybe we could make EmbedConnection.filterProperties() return an instrumented sub-class of Properties that logs the stack trace of all calls to getProperty(), run some subset of the regression tests, and then manually inspect the call sites to see if they would handle the disappearance of a property?

          Show
          Knut Anders Hatlen added a comment - I logged DERBY-5830 for fixing DoubleProperties.propertyNames(), so we can separate that issue from any changes that need to be done to the slave boot code. So far, I think we have discussed three areas that potentially need fixing: 1) Lack of thread-safety in DoubleProperties - now handled in DERBY-5830 . 2) Boot code not prepared for being multi-threaded. The fix for (1) may be sufficient, but there may be other hidden bugs caused by the slave boot process breaking assumptions made in existing code. 3) SlaveDatabaseBootThread silently ignores runtime exceptions and leaves the process hanging. The fix for (1) may remove the cause for the hang we've seen, but SlaveDatabaseBootThread still doesn't handle unexpected failures gracefully. I'm not sure what's the best way to approach (2). It's difficult to see from the code exactly which places the Properties instance in question is used. Maybe we could make EmbedConnection.filterProperties() return an instrumented sub-class of Properties that logs the stack trace of all calls to getProperty(), run some subset of the regression tests, and then manually inspect the call sites to see if they would handle the disappearance of a property?
          Hide
          Dag H. Wanvik added a comment -

          Thanks for filing DERBY-5830. I added some code in my patch to log errors to derby.log if SlaveDatabaseBootThread fails in 3). How graceful it is is another matter, but we should at least learn what the error is. Under normal operation the boot should not fail, one would hope. As for checking usages, I have already made a Derby specific version of Properties to find the original race, so I can use that for tracking down usages of getProperties.

          Show
          Dag H. Wanvik added a comment - Thanks for filing DERBY-5830 . I added some code in my patch to log errors to derby.log if SlaveDatabaseBootThread fails in 3). How graceful it is is another matter, but we should at least learn what the error is. Under normal operation the boot should not fail, one would hope. As for checking usages, I have already made a Derby specific version of Properties to find the original race, so I can use that for tracking down usages of getProperties.
          Hide
          Dag H. Wanvik added a comment - - edited

          I instrumented all calls to propertyNames while in the slave boot code running ReplicationRun_Local_3_p6-sb1. It turns out the only place where DoubleProperties#propertyNames
          is called is here in BaseMonitor#getImplementations (cf also the original backtrace above):

          nextModule:
          for (Enumeration e = moduleList.propertyNames(); e.hasMoreElements(); ) {
          String key = (String) e.nextElement();

          // module tagged name in the modules.properties file.
          // used as the tag for dependent properties.
          String tag;

          // Dynamically loaded code is defined by a property of
          // the form:
          // derby.module.<modulename>=<class name>
          // or
          // derby.subSubProtocol.<modulename>=<classname>

          if (key.startsWith(Property.MODULE_PREFIX))

          { tag = key.substring(Property.MODULE_PREFIX.length()); } else if (key.startsWith(Property.SUB_SUB_PROTOCOL_PREFIX)) { tag = key.substring(Property.MODULE_PREFIX.length()); }

          else

          { continue nextModule; }

          This means that all properties that do not start with MODULE_PREFIX ("derby.module.") or SUB_SUB_PROTOCOL_PREFIX ("derby.subProtocol.") are skipped. None of those are legal URL properties, and could thus stem from the contested info properties object acclocated int EmbedConnection.

          I therefore believe we can avoid doing the explicit synchronization when DERBY-5830 goes in.

          I'll run the instrumented versions on all the replication tests for good measure before I conclude, though.

          [Comment edited 2012-07-03: completed tests, the above theory holds.]

          Show
          Dag H. Wanvik added a comment - - edited I instrumented all calls to propertyNames while in the slave boot code running ReplicationRun_Local_3_p6-sb1. It turns out the only place where DoubleProperties#propertyNames is called is here in BaseMonitor#getImplementations (cf also the original backtrace above): nextModule: for (Enumeration e = moduleList.propertyNames(); e.hasMoreElements(); ) { String key = (String) e.nextElement(); // module tagged name in the modules.properties file. // used as the tag for dependent properties. String tag; // Dynamically loaded code is defined by a property of // the form: // derby.module.<modulename>=<class name> // or // derby.subSubProtocol.<modulename>=<classname> if (key.startsWith(Property.MODULE_PREFIX)) { tag = key.substring(Property.MODULE_PREFIX.length()); } else if (key.startsWith(Property.SUB_SUB_PROTOCOL_PREFIX)) { tag = key.substring(Property.MODULE_PREFIX.length()); } else { continue nextModule; } This means that all properties that do not start with MODULE_PREFIX ("derby.module.") or SUB_SUB_PROTOCOL_PREFIX ("derby.subProtocol.") are skipped. None of those are legal URL properties, and could thus stem from the contested info properties object acclocated int EmbedConnection. I therefore believe we can avoid doing the explicit synchronization when DERBY-5830 goes in. I'll run the instrumented versions on all the replication tests for good measure before I conclude, though. [Comment edited 2012-07-03: completed tests, the above theory holds.]
          Hide
          Dag H. Wanvik added a comment -

          Uploading a patch for item 3) above; fix-ignored-runtime-exceptions-a as well as a patch building on that which adds instrumentation to test it manually.

          Using the instrummentation I see this on derby.log:

          ---- BEGIN REPLICATION ERROR MESSAGE (02.07.12 21:58) ----
          Replication slave got a fatal error for database '/export/home/dag/NetBeansProjects/JUnitTest/test/db_slave/wombat'. Replication will be stopped.
          java.lang.NullPointerException
          at org.apache.derby.impl.db.SlaveDatabase$SlaveDatabaseBootThread.run(SlaveDatabase.java:324)
          at java.lang.Thread.run(Thread.java:722)

          -------------------- END REPLICATION ERROR MESSAGE ---------------------

          Show
          Dag H. Wanvik added a comment - Uploading a patch for item 3) above; fix-ignored-runtime-exceptions-a as well as a patch building on that which adds instrumentation to test it manually. Using the instrummentation I see this on derby.log: ---- BEGIN REPLICATION ERROR MESSAGE (02.07.12 21:58) ---- Replication slave got a fatal error for database '/export/home/dag/NetBeansProjects/JUnitTest/test/db_slave/wombat'. Replication will be stopped. java.lang.NullPointerException at org.apache.derby.impl.db.SlaveDatabase$SlaveDatabaseBootThread.run(SlaveDatabase.java:324) at java.lang.Thread.run(Thread.java:722) -------------------- END REPLICATION ERROR MESSAGE ---------------------
          Hide
          Knut Anders Hatlen added a comment -

          The patch looks fine to me.

          Minor nit: Since SlaveDatabaseBootThread is a non-static inner class of SlaveDatabase, the fields and methods of the parent SlaveDatabase instance can be accessed directly from the SlaveDatabaseBootThread instance. So it should be possible to read the value of the dbname field in SlaveDatabaseBootThread.run() without introducing an accessor method and a new field.

          Show
          Knut Anders Hatlen added a comment - The patch looks fine to me. Minor nit: Since SlaveDatabaseBootThread is a non-static inner class of SlaveDatabase, the fields and methods of the parent SlaveDatabase instance can be accessed directly from the SlaveDatabaseBootThread instance. So it should be possible to read the value of the dbname field in SlaveDatabaseBootThread.run() without introducing an accessor method and a new field.
          Hide
          Dag H. Wanvik added a comment -

          Thanks, Knut. Fixed your nit in version fix-ignored-runtime-exceptions-b.

          Show
          Dag H. Wanvik added a comment - Thanks, Knut. Fixed your nit in version fix-ignored-runtime-exceptions-b.
          Hide
          Dag H. Wanvik added a comment -

          Committed fix-ignored-runtime-exceptions-b as svn 1356766, resolving.

          Show
          Dag H. Wanvik added a comment - Committed fix-ignored-runtime-exceptions-b as svn 1356766, resolving.
          Hide
          Dag H. Wanvik added a comment -

          Backported fix-ignored-runtime-exceptions-b to 10.9 at svn 1357310.

          Show
          Dag H. Wanvik added a comment - Backported fix-ignored-runtime-exceptions-b to 10.9 at svn 1357310.
          Hide
          Dag H. Wanvik added a comment -

          Backported fix-ignored-runtime-exceptions-b to 10.8 at svn 1357343.

          Will close issue when DERBY-5830 is backported, too.

          Show
          Dag H. Wanvik added a comment - Backported fix-ignored-runtime-exceptions-b to 10.8 at svn 1357343. Will close issue when DERBY-5830 is backported, too.

            People

            • Assignee:
              Dag H. Wanvik
              Reporter:
              Ole Solberg
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development