Derby
  1. Derby
  2. DERBY-5222

Compatibility tests fail to delete database directory

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.8.1.2, 10.9.1.0
    • Fix Version/s: 10.8.2.2, 10.9.1.0
    • Component/s: Test
    • Labels:
      None
    • Bug behavior facts:
      Regression Test Failure

      Description

      The compatibility tests fail frequently on one platform (Solaris 10, x86 (32-bit), Java 7 ea b131). For example here: http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.7/testing/testlog/sol32/1100759-compatibility_diff.txt

      From what's printed to the console, it looks like it fails to delete the database directory:

      > Failed deleting database dir '/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat'

      And then, when running the actual test, it fails to create the log directory on that location (presumably because one already exists):

      Exception in thread "main" java.sql.SQLException: Failed to start database 'wombat' with class loader sun.misc.Launcher$AppClassLoader@53c015, see the next exception for details.
      (...)
      Caused by: java.sql.SQLException: cannot create log file at directory /export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/log.
      (...)
      Caused by: ERROR XSLAQ: cannot create log file at directory /export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/log.
      at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
      at org.apache.derby.impl.store.raw.log.LogToFile.getLogDirectory(Unknown Source)
      at org.apache.derby.impl.store.raw.log.LogToFile.getControlFileName(Unknown Source)
      at org.apache.derby.impl.store.raw.log.LogToFile.boot(Unknown Source)
      at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
      at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
      at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
      at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
      at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.bootLogFactory(Unknown Source)
      at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.setRawStoreFactory(Unknown Source)
      at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source)
      (...)

      1. d5222-3a.diff
        3 kB
        Knut Anders Hatlen
      2. d5222-2a-waitForProc.diff
        0.9 kB
        Knut Anders Hatlen
      3. d5222-1a-logging.diff
        2 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          Merged to 10.8. Committed revision 1138409.

          Show
          Knut Anders Hatlen added a comment - Merged to 10.8. Committed revision 1138409.
          Hide
          Knut Anders Hatlen added a comment -

          The failure hasn't been seen again after the last commit. Marking the issue as resolved.

          Show
          Knut Anders Hatlen added a comment - The failure hasn't been seen again after the last commit. Marking the issue as resolved.
          Hide
          Knut Anders Hatlen added a comment -

          The test hasn't failed on that platform in the first two test cycles after the commit. It failed five times in a row just before the commit. So it looks promising, but let's give it a few more days before we declare victory.

          Show
          Knut Anders Hatlen added a comment - The test hasn't failed on that platform in the first two test cycles after the commit. It failed five times in a row just before the commit. So it looks promising, but let's give it a few more days before we declare victory.
          Hide
          Knut Anders Hatlen added a comment -

          Committed revision 1103742.

          Show
          Knut Anders Hatlen added a comment - Committed revision 1103742.
          Hide
          Knut Anders Hatlen added a comment -

          There are more processes that get started by the compatibility test and that we don't wait for to complete (for example the process in which the network server runs). Attaching yet another patch (3a) that waits for all processes started by the compatibility test before moving on to the next combination. Maybe this isn't what's causing the failures, but it's probably worth fixing in any case. The compatibility test ran cleanly in my environment with the patch.

          Show
          Knut Anders Hatlen added a comment - There are more processes that get started by the compatibility test and that we don't wait for to complete (for example the process in which the network server runs). Attaching yet another patch (3a) that waits for all processes started by the compatibility test before moving on to the next combination. Maybe this isn't what's causing the failures, but it's probably worth fixing in any case. The compatibility test ran cleanly in my environment with the patch.
          Hide
          Knut Anders Hatlen added a comment -

          Nope, that didn't do the trick. It's still failing...

          Show
          Knut Anders Hatlen added a comment - Nope, that didn't do the trick. It's still failing...
          Hide
          Knut Anders Hatlen added a comment -

          I think that's how the test works, yes. Attaching a patch (2a) that calls Process.waitFor() after having read the output from the process. The compatibility tests ran cleanly in my environment with the patch, so I committed it to trunk with revision 1102620. Let's see if this silences the test.

          Show
          Knut Anders Hatlen added a comment - I think that's how the test works, yes. Attaching a patch (2a) that calls Process.waitFor() after having read the output from the process. The compatibility tests ran cleanly in my environment with the patch, so I committed it to trunk with revision 1102620. Let's see if this silences the test.
          Hide
          Kristian Waagan added a comment -

          If that's how the test works, the theory sounds very reasonable.

          Regarding improvements, I'd opt for fixing the test logic and keeping BaseTestCase.removeDirectory as it is, exactly for the reason you mention, even though the method has actually succeeded in removing the directory.

          Show
          Kristian Waagan added a comment - If that's how the test works, the theory sounds very reasonable. Regarding improvements, I'd opt for fixing the test logic and keeping BaseTestCase.removeDirectory as it is, exactly for the reason you mention, even though the method has actually succeeded in removing the directory.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for those observations, Kristian!

          How's this theory:

          The embedded variant of the compat test doesn't wait for the forked process to shut down before it moves on to the next version (it does wait until is sees that the process prints "OK", but that doesn't mean every database thread has stopped). If a checkpoint happens to be running, the forked process could be deleting the stub conglomerates when we invoke removeDirectory().

          If such a stub is deleted after removeDirectory() calls File.list() and before it actually calls delete(), it'll fail to delete it (since it's already deleted) and add it to the list of failed deletes. However, it will still try to delete the parent directory, and that's successful since there aren't any files there.

          The code that deleted the directory before we switched to using BaseTestCase.removeDirectory() would give up as soon as one of the files couldn't be deleted, so the next test would fail because it found a half-deleted database. BaseTestCase.removeDirectory() is on the other extreme: it goes on trying to delete files even after it has failed to delete one, and gets surprised when it sees that it actually had succeeded.

          I can see these possible solutions (not mutually exclusive):

          1) Make the compatibility test wait for the forked process to complete.

          2) Stop BaseTestCase.removeDirectory() from failing if the reason why it couldn't delete a file was that the file no longer existed.

          We should probably at least do 1. Not so sure about 2, since it usually means that there is a problem if we have a delete race, and it would be good to get it reported.

          Show
          Knut Anders Hatlen added a comment - Thanks for those observations, Kristian! How's this theory: The embedded variant of the compat test doesn't wait for the forked process to shut down before it moves on to the next version (it does wait until is sees that the process prints "OK", but that doesn't mean every database thread has stopped). If a checkpoint happens to be running, the forked process could be deleting the stub conglomerates when we invoke removeDirectory(). If such a stub is deleted after removeDirectory() calls File.list() and before it actually calls delete(), it'll fail to delete it (since it's already deleted) and add it to the list of failed deletes. However, it will still try to delete the parent directory, and that's successful since there aren't any files there. The code that deleted the directory before we switched to using BaseTestCase.removeDirectory() would give up as soon as one of the files couldn't be deleted, so the next test would fail because it found a half-deleted database. BaseTestCase.removeDirectory() is on the other extreme: it goes on trying to delete files even after it has failed to delete one, and gets surprised when it sees that it actually had succeeded. I can see these possible solutions (not mutually exclusive): 1) Make the compatibility test wait for the forked process to complete. 2) Stop BaseTestCase.removeDirectory() from failing if the reason why it couldn't delete a file was that the file no longer existed. We should probably at least do 1. Not so sure about 2, since it usually means that there is a problem if we have a delete race, and it would be good to get it reported.
          Hide
          Kristian Waagan added a comment -

          I'm puzzled by the output from the delete method and the test log file, but here is what I see:
          a) The first round of attempted deletes fails to delete 37 files.
          b) The next round fails with an assert because the database root directory isn't found:
          junit.framework.AssertionFailedError: directory doesn't exist: /export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat

          At this point, these facts supports at least these two theories (assuming the Derby delete method is correct):
          1) A "delete race" taking place (directory/files deleted from two or more threads/processes concurrently).
          2) A bug in the VM, either misreporting the outcome of the deletions, or allowing non-empty directories to be deleted.

          However, looking at the files that couldn't be deleted, one sees that they are all stub files (starting with 'd' instead of 'c'). I don't know the specifics, but I believe stub files are deleted by Derby itself at certain times (for instance during/after a checkpoint?). That makes this bug a timing and/or order sensitive bug with respect to when the files are attempted deleted. One explanation is that Derby is still shutting down when the delete method obtains the list of files to delete, and when it tries to delete the stub files they have already been deleted by Derby.

          This is merely a hypothesis, and it depends on whether the database server is shut down completely when the deletion begins.
          Anyone with knowledge about the runtime characteristics of the upgrade test around?

          Show
          Kristian Waagan added a comment - I'm puzzled by the output from the delete method and the test log file, but here is what I see: a) The first round of attempted deletes fails to delete 37 files. b) The next round fails with an assert because the database root directory isn't found: junit.framework.AssertionFailedError: directory doesn't exist: /export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat At this point, these facts supports at least these two theories (assuming the Derby delete method is correct): 1) A "delete race" taking place (directory/files deleted from two or more threads/processes concurrently). 2) A bug in the VM, either misreporting the outcome of the deletions, or allowing non-empty directories to be deleted. However, looking at the files that couldn't be deleted, one sees that they are all stub files (starting with 'd' instead of 'c'). I don't know the specifics, but I believe stub files are deleted by Derby itself at certain times (for instance during/after a checkpoint?). That makes this bug a timing and/or order sensitive bug with respect to when the files are attempted deleted. One explanation is that Derby is still shutting down when the delete method obtains the list of files to delete, and when it tries to delete the stub files they have already been deleted by Derby. This is merely a hypothesis, and it depends on whether the database server is shut down completely when the deletion begins. Anyone with knowledge about the runtime characteristics of the upgrade test around?
          Hide
          Knut Anders Hatlen added a comment -

          Failed again today: http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.7/testing/testlog/sol32/1101932-compatibility_diff.txt

          It looks like there's a problem with deleting some of the files in the seg0 directory, but it's hard to say exactly what the problem is.

          Deleting database dir '/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat'
          <assertDirectoryDeleted> attempt 1 left 37 files/dirs behind: 0=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d480.dat 1=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d490.dat 2=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d4b0.dat 3=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d4f0.dat 4=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d510.dat 5=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d4d0.dat 6=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d520.dat 7=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d550.dat 8=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d580.dat 9=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d5a0.dat 10=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d560.dat 11=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d5b0.dat 12=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d5d0.dat 13=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d630.dat 14=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d670.dat 15=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d690.dat 16=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d650.dat 17=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d6c0.dat 18=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d6f0.dat 19=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d720.dat 20=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d760.dat 21=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d770.dat 22=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d7e0.dat 23=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d810.dat 24=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d850.dat 25=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d880.dat 26=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d8b0.dat 27=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d8d0.dat 28=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d890.dat 29=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d8e0.dat 30=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d8c0.dat 31=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d930.dat 32=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d960.dat 33=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d9f0.dat 34=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d9b0.dat 35=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/da10.dat 36=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/da60.dat

          Show
          Knut Anders Hatlen added a comment - Failed again today: http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.7/testing/testlog/sol32/1101932-compatibility_diff.txt It looks like there's a problem with deleting some of the files in the seg0 directory, but it's hard to say exactly what the problem is. Deleting database dir '/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat' <assertDirectoryDeleted> attempt 1 left 37 files/dirs behind: 0=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d480.dat 1=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d490.dat 2=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d4b0.dat 3=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d4f0.dat 4=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d510.dat 5=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d4d0.dat 6=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d520.dat 7=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d550.dat 8=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d580.dat 9=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d5a0.dat 10=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d560.dat 11=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d5b0.dat 12=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d5d0.dat 13=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d630.dat 14=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d670.dat 15=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d690.dat 16=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d650.dat 17=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d6c0.dat 18=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d6f0.dat 19=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d720.dat 20=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d760.dat 21=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d770.dat 22=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d7e0.dat 23=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d810.dat 24=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d850.dat 25=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d880.dat 26=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d8b0.dat 27=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d8d0.dat 28=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d890.dat 29=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d8e0.dat 30=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d8c0.dat 31=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d930.dat 32=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d960.dat 33=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d9f0.dat 34=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/d9b0.dat 35=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/da10.dat 36=/export/home/tmp/jagtmp/autoderbyN_regression/compatibility_3/log/wombat/seg0/da60.dat
          Hide
          Knut Anders Hatlen added a comment -

          This test has failed every day for a week now. But, of course, once we added extra logging to see what was going on, it didn't fail...

          Show
          Knut Anders Hatlen added a comment - This test has failed every day for a week now. But, of course, once we added extra logging to see what was going on, it didn't fail...
          Hide
          Knut Anders Hatlen added a comment -

          The patch d5222-1a-logging.diff makes CompatibilityCombinations use a helper method from BaseTestCase to delete the database directory. That helper method has better logging when something goes wrong. At least we should be able to see which files it cannot delete. Committed to trunk with revision 1101059.

          Show
          Knut Anders Hatlen added a comment - The patch d5222-1a-logging.diff makes CompatibilityCombinations use a helper method from BaseTestCase to delete the database directory. That helper method has better logging when something goes wrong. At least we should be able to see which files it cannot delete. Committed to trunk with revision 1101059.

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Knut Anders Hatlen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development