Derby
  1. Derby
  2. DERBY-5937

File handle is leaked when a Slave replication is shutdown with failover=true

    Details

    • Urgency:
      Normal
    • Bug behavior facts:
      Seen in production

      Description

      As part of our use of derby replication, we shut the slave down if we detect reachability issues with the master. Normally we shut it down, and bring it back up as a regular database in read-only mode. Then when the master can, it tries to push a fresh copy back to the slave system. However, during the failover=true shutdown on the slave, the log file ".../name/log/log1.dat" is still open.

      Because of this open file, at least on Windows, it's impossible to move the database file out of the way, and to install a fresh copy from the master.

      1. backport-10.8.diff
        6 kB
        Knut Anders Hatlen
      2. backport-10.9.diff
        10 kB
        Knut Anders Hatlen
      3. d5937-1a-test.diff
        12 kB
        Knut Anders Hatlen
      4. d5937-2a-close.diff
        2 kB
        Knut Anders Hatlen
      5. d5937-3a-test-datasource.diff
        8 kB
        Knut Anders Hatlen

        Activity

        Hide
        Knut Anders Hatlen added a comment -

        Hi Glenn,

        If I understand correctly, this is what you're doing when you see the problem:

        1) You have a master database that is being replicated to the slave, and you detect some kind of trouble with the master database

        2) You invoke failover=true on the slave database (which completes as expected)

        3) You then invoke shutdown=true on the slave database (which also completes as expected)

        After these steps, the slave process is still holding log1.dat open.

        Does that sound about right?

        Thanks.

        Show
        Knut Anders Hatlen added a comment - Hi Glenn, If I understand correctly, this is what you're doing when you see the problem: 1) You have a master database that is being replicated to the slave, and you detect some kind of trouble with the master database 2) You invoke failover=true on the slave database (which completes as expected) 3) You then invoke shutdown=true on the slave database (which also completes as expected) After these steps, the slave process is still holding log1.dat open. Does that sound about right? Thanks.
        Hide
        Knut Anders Hatlen added a comment -

        I just tried the above steps with OpenJDK on FreeBSD, and verified with lsof that log1.dat indeed is still open.

        Show
        Knut Anders Hatlen added a comment - I just tried the above steps with OpenJDK on FreeBSD, and verified with lsof that log1.dat indeed is still open.
        Hide
        Glenn McGregor added a comment -

        1. Close enough. The master database is embedded in a JVM that is pinged using an RMI call and some timers. When that fails, the switchover is attempted.
        2. Yes.
        2.5 I change the database property defaultConnectionMode to readOnlyAccess. The database is left up (in readonly mode) for use until the master is reachable again.

        ... time passes

        3. The master detects it can talk to the slave JVM, and does a shutdown=true on the slave database. It succeeds.
        4. Many additional steps to get it back into master/slave replication.

        Show
        Glenn McGregor added a comment - 1. Close enough. The master database is embedded in a JVM that is pinged using an RMI call and some timers. When that fails, the switchover is attempted. 2. Yes. 2.5 I change the database property defaultConnectionMode to readOnlyAccess. The database is left up (in readonly mode) for use until the master is reachable again. ... time passes 3. The master detects it can talk to the slave JVM, and does a shutdown=true on the slave database. It succeeds. 4. Many additional steps to get it back into master/slave replication.
        Hide
        Knut Anders Hatlen added a comment -

        The patch d5937-1a-test.diff adds a regression test case that exposes the bug when run on Windows. The test case fails because it cannot delete all files in the slave database directory after shutdown:

        <assertDirectoryDeleted> attempt 1 left 3 files/dirs behind: 0=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log\log1.dat 1=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log 2=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db
        <assertDirectoryDeleted> attempt 2 left 3 files/dirs behind: 0=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log\log1.dat 1=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log 2=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db
        <assertDirectoryDeleted> attempt 3 left 3 files/dirs behind: 0=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log\log1.dat 1=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log 2=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db
        <assertDirectoryDeleted> attempt 4 left 3 files/dirs behind: 0=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log\log1.dat 1=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log 2=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db
        F
        Time: 18,891
        There was 1 failure:
        1) testSlaveFailoverLeak(org.apache.derbyTesting.functionTests.tests.replicationTests.Derby5937SlaveShutdownTest)junit.framework.AssertionFailedError: Failed to delete 3 files (root=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db): C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log\log1.dat (isDir=false, canRead=true, canWrite=true, size=1048576), C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log (isDir=true, canRead=true, canWrite=true, size=0), C:\cygwin\home\lroot\derbytst\system\d5937-slave-db (isDir=true, canRead=true, canWrite=true, size=4096)
        at org.apache.derbyTesting.junit.BaseTestCase.assertDirectoryDeleted(BaseTestCase.java:1027)
        at org.apache.derbyTesting.functionTests.tests.replicationTests.Derby5937SlaveShutdownTest.testSlaveFailoverLeak(Derby5937SlaveShutdownTest.java:169)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:117)
        at org.apache.derbyTesting.junit.BaseJDBCTestCase.runBareOverridable(BaseJDBCTestCase.java:424)
        at org.apache.derbyTesting.junit.BaseJDBCTestCase.runBare(BaseJDBCTestCase.java:441)
        at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
        at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
        at junit.extensions.TestSetup.run(TestSetup.java:25)
        at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
        at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
        at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
        at junit.extensions.TestSetup.run(TestSetup.java:25)
        at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
        at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
        at junit.extensions.TestSetup.run(TestSetup.java:25)

        FAILURES!!!
        Tests run: 1, Failures: 1, Errors: 0

        Show
        Knut Anders Hatlen added a comment - The patch d5937-1a-test.diff adds a regression test case that exposes the bug when run on Windows. The test case fails because it cannot delete all files in the slave database directory after shutdown: <assertDirectoryDeleted> attempt 1 left 3 files/dirs behind: 0=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log\log1.dat 1=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log 2=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db <assertDirectoryDeleted> attempt 2 left 3 files/dirs behind: 0=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log\log1.dat 1=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log 2=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db <assertDirectoryDeleted> attempt 3 left 3 files/dirs behind: 0=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log\log1.dat 1=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log 2=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db <assertDirectoryDeleted> attempt 4 left 3 files/dirs behind: 0=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log\log1.dat 1=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log 2=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db F Time: 18,891 There was 1 failure: 1) testSlaveFailoverLeak(org.apache.derbyTesting.functionTests.tests.replicationTests.Derby5937SlaveShutdownTest)junit.framework.AssertionFailedError: Failed to delete 3 files (root=C:\cygwin\home\lroot\derbytst\system\d5937-slave-db): C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log\log1.dat (isDir=false, canRead=true, canWrite=true, size=1048576), C:\cygwin\home\lroot\derbytst\system\d5937-slave-db\log (isDir=true, canRead=true, canWrite=true, size=0), C:\cygwin\home\lroot\derbytst\system\d5937-slave-db (isDir=true, canRead=true, canWrite=true, size=4096) at org.apache.derbyTesting.junit.BaseTestCase.assertDirectoryDeleted(BaseTestCase.java:1027) at org.apache.derbyTesting.functionTests.tests.replicationTests.Derby5937SlaveShutdownTest.testSlaveFailoverLeak(Derby5937SlaveShutdownTest.java:169) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:117) at org.apache.derbyTesting.junit.BaseJDBCTestCase.runBareOverridable(BaseJDBCTestCase.java:424) at org.apache.derbyTesting.junit.BaseJDBCTestCase.runBare(BaseJDBCTestCase.java:441) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24) at junit.extensions.TestSetup$1.protect(TestSetup.java:21) at junit.extensions.TestSetup.run(TestSetup.java:25) at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24) at junit.extensions.TestSetup$1.protect(TestSetup.java:21) at junit.extensions.TestSetup.run(TestSetup.java:25) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24) at junit.extensions.TestSetup$1.protect(TestSetup.java:21) at junit.extensions.TestSetup.run(TestSetup.java:25) FAILURES!!! Tests run: 1, Failures: 1, Errors: 0
        Hide
        Knut Anders Hatlen added a comment -

        I've committed the test case to trunk with revision 1392847. It's not enabled yet, though.

        Show
        Knut Anders Hatlen added a comment - I've committed the test case to trunk with revision 1392847. It's not enabled yet, though.
        Hide
        Kristian Waagan added a comment -

        Verified that the test failed, as expected, on Windows 7 (Cygwin + cmd).

        Show
        Kristian Waagan added a comment - Verified that the test failed, as expected, on Windows 7 (Cygwin + cmd).
        Hide
        Knut Anders Hatlen added a comment -

        Thanks for verifying the test, Kristian.

        I'm attaching a new patch (d5937-2a-close.diff) which attempts to fix the bug.

        While a slave is running, LogToFile blocks in recover() waiting for fail-over to happen. When it is notified that fail-over has happened, it continues with ordinary recovery as a first step in booting the database, and in this process it reads the log files. In a "normal" boot process, the log file isn't already open when recovery runs, so recover() doesn't care about closing the currently active log file first. This causes a problem when promoting a slave database, which holds a log file open, as the re-opening of the log file will make LogToFile forget about the old file handle without ever closing it.

        The fix makes recover() check if a log file is open, and close that file, before re-opening the log. It also enables the regression test case for the bug in ReplicationSuite.

        I've verified that the test case passes on Windows when the fix is applied. I'm also running the full regression test suite on Solaris and Windows. Will report back when all the tests have completed.

        Show
        Knut Anders Hatlen added a comment - Thanks for verifying the test, Kristian. I'm attaching a new patch (d5937-2a-close.diff) which attempts to fix the bug. While a slave is running, LogToFile blocks in recover() waiting for fail-over to happen. When it is notified that fail-over has happened, it continues with ordinary recovery as a first step in booting the database, and in this process it reads the log files. In a "normal" boot process, the log file isn't already open when recovery runs, so recover() doesn't care about closing the currently active log file first. This causes a problem when promoting a slave database, which holds a log file open, as the re-opening of the log file will make LogToFile forget about the old file handle without ever closing it. The fix makes recover() check if a log file is open, and close that file, before re-opening the log. It also enables the regression test case for the bug in ReplicationSuite. I've verified that the test case passes on Windows when the fix is applied. I'm also running the full regression test suite on Solaris and Windows. Will report back when all the tests have completed.
        Hide
        Knut Anders Hatlen added a comment -

        All the regression tests ran cleanly on Solaris.

        There were two failures on Windows. One of them looked identical to the already known instability logged as DERBY-5866. The other failure was in lang/triggerGeneral.sql, apparently caused by a race between the B-tree post-commit worker thread and SYSCS_DIAG.LOCK_TABLE, resulting in more locks than expected in the lock table. I couldn't find any JIRA issue for the latter failure, so I'll file a new one.

        None of the test failures seem to be related to the patch, so I'm setting the Patch Available flag.

        Show
        Knut Anders Hatlen added a comment - All the regression tests ran cleanly on Solaris. There were two failures on Windows. One of them looked identical to the already known instability logged as DERBY-5866 . The other failure was in lang/triggerGeneral.sql, apparently caused by a race between the B-tree post-commit worker thread and SYSCS_DIAG.LOCK_TABLE, resulting in more locks than expected in the lock table. I couldn't find any JIRA issue for the latter failure, so I'll file a new one. None of the test failures seem to be related to the patch, so I'm setting the Patch Available flag.
        Hide
        Knut Anders Hatlen added a comment -

        Committed revision 1394407.

        I'll keep the issue open for backporting to 10.9 and 10.8.

        Show
        Knut Anders Hatlen added a comment - Committed revision 1394407. I'll keep the issue open for backporting to 10.9 and 10.8.
        Hide
        Knut Anders Hatlen added a comment -

        The new test case hangs forever when run on Java 5. The root cause is the same as DERBY-5607. DriverManager.getConnection() used to be synchronized in Java 5/JDBC 3, and since the test case runs the slave and the master in the same JVM, it runs into a deadlock when setting up replication. That is, the startSlave command blocks in DriverManager.getConnection() until the master has started, but the master cannot start because it can't get into DriverManager.getConnection() as long as the slave hasn't completed its startup. This is not a problem in Java 6 or later, as DriverManager.getConnection() is no longer synchronized.

        I've attached a patch (3a) that makes the test use DataSource instead of DriverManager. This avoids the problem on Java 5. It also makes it possible to run the test on JSR-169 platforms, which don't have the DriverManager class.

        Committed revision 1395482.

        Show
        Knut Anders Hatlen added a comment - The new test case hangs forever when run on Java 5. The root cause is the same as DERBY-5607 . DriverManager.getConnection() used to be synchronized in Java 5/JDBC 3, and since the test case runs the slave and the master in the same JVM, it runs into a deadlock when setting up replication. That is, the startSlave command blocks in DriverManager.getConnection() until the master has started, but the master cannot start because it can't get into DriverManager.getConnection() as long as the slave hasn't completed its startup. This is not a problem in Java 6 or later, as DriverManager.getConnection() is no longer synchronized. I've attached a patch (3a) that makes the test use DataSource instead of DriverManager. This avoids the problem on Java 5. It also makes it possible to run the test on JSR-169 platforms, which don't have the DriverManager class. Committed revision 1395482.
        Hide
        Knut Anders Hatlen added a comment -

        The regression test case uses a helper method, TestConfiguration.getDatabasePath(), which is not available on the 10.9 branch. The attached backport-10.9.diff patch is a backport of the three commits that went into trunk, with the missing method added manually. All the regression tests ran cleanly on the 10.9 branch with this patch.

        Show
        Knut Anders Hatlen added a comment - The regression test case uses a helper method, TestConfiguration.getDatabasePath(), which is not available on the 10.9 branch. The attached backport-10.9.diff patch is a backport of the three commits that went into trunk, with the missing method added manually. All the regression tests ran cleanly on the 10.9 branch with this patch.
        Hide
        Knut Anders Hatlen added a comment -

        Committed the 10.9 backport with revision 1396606.

        Show
        Knut Anders Hatlen added a comment - Committed the 10.9 backport with revision 1396606.
        Hide
        Knut Anders Hatlen added a comment -

        In addition to the issues seen when backporting to 10.9, the test framework on the 10.8 branch is missing functionality for adding permissions incrementally, so I had to disable the security manager in the test case in order to get it to compile on 10.8. See the attached backport-10.8.diff patch.

        All the regression tests ran cleanly on the 10.8 branch.

        Show
        Knut Anders Hatlen added a comment - In addition to the issues seen when backporting to 10.9, the test framework on the 10.8 branch is missing functionality for adding permissions incrementally, so I had to disable the security manager in the test case in order to get it to compile on 10.8. See the attached backport-10.8.diff patch. All the regression tests ran cleanly on the 10.8 branch.
        Hide
        Knut Anders Hatlen added a comment -

        Committed the 10.8 backport with revision 1398947.

        I'm not planning to backport it further for now. Marking the issue as resolved.

        Show
        Knut Anders Hatlen added a comment - Committed the 10.8 backport with revision 1398947. I'm not planning to backport it further for now. Marking the issue as resolved.

          People

          • Assignee:
            Knut Anders Hatlen
            Reporter:
            Glenn McGregor
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development