Derby
  1. Derby
  2. DERBY-5108

Intermittent failure in AutomaticIndexStatisticsTest.testShutdownWhileScanningThenDelete on Windows

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 10.8.1.2
    • Fix Version/s: 10.8.2.2, 10.9.1.0
    • Component/s: Services
    • Labels:
      None
    • Environment:
      Windows platforms.
    • Urgency:
      Normal
    • Bug behavior facts:
      Regression, Regression Test Failure

      Description

      The test AutomaticIndexStatisticsTest.testShutdownWhileScanningThenDelete fails intermittently on Windows platforms because the test is unable to delete a database directory.
      Even after several retries and sleeps (the formula should be (attempt -1) * 2000, resulting in a total sleep time of 12 seconds), the conglomerate system\singleUse\copyShutdown\seg0\c481.dat cannot be deleted.

      For instance from http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.6/testing/testlog/w2003/1078855-suitesAll_diff.txt :
      (truncated paths)
      testShutdownWhileScanningThenDelete <assertDirectoryDeleted> attempt 1 left 3 files/dirs behind: 0=system\singleUse\copyShutdown\seg0\c481.dat 1=system\singleUse\copyShutdown\seg0 2=system\singleUse\copyShutdown
      <assertDirectoryDeleted> attempt 2 left 3 files/dirs behind: 0=system\singleUse\copyShutdown\seg0\c481.dat 1=system\singleUse\copyShutdown\seg0 2=system\singleUse\copyShutdown
      <assertDirectoryDeleted> attempt 3 left 3 files/dirs behind: 0=system\singleUse\copyShutdown\seg0\c481.dat 1=system\singleUse\copyShutdown\seg0 2=system\singleUse\copyShutdown
      <assertDirectoryDeleted> attempt 4 left 3 files/dirs behind: 0=system\singleUse\copyShutdown\seg0\c481.dat 1=system\singleUse\copyShutdown\seg0 2=system\singleUse\copyShutdown
      used 205814 ms F.

      Maybe the database isn't shut down, or some specific timing of events causes a file to be reopened when it shouldn't have been (i.e. after the database shutdown has been initiated).

      1. derby-5108-1a-early_istats_shutdown.diff
        2 kB
        Kristian Waagan
      2. derby-5108-2.diff
        1.0 kB
        Dag H. Wanvik
      3. derby-5108-2a-early_istats_shutdown_broad.diff
        3 kB
        Kristian Waagan
      4. derby-5108-3a-istat_log_abort.diff
        1 kB
        Kristian Waagan
      5. javacore.20110309.125807.4048.0001.txt
        323 kB
        Mike Matrigali
      6. waitOnShutdown.diff
        3 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Kristian Waagan created issue -
          Hide
          Kathey Marsden added a comment - - edited

          I saw this on weme 6.2 10.8.0.1 alpha - (1079089) 3/7/2011

          Show
          Kathey Marsden added a comment - - edited I saw this on weme 6.2 10.8.0.1 alpha - (1079089) 3/7/2011
          Hide
          Myrna van Lunteren added a comment -

          I saw this with ibm 1.6 - 10.8.0.1 alpha (1078654), on windows XP.

          Show
          Myrna van Lunteren added a comment - I saw this with ibm 1.6 - 10.8.0.1 alpha (1078654), on windows XP.
          Kristian Waagan made changes -
          Field Original Value New Value
          Link This issue is related to DERBY-4915 [ DERBY-4915 ]
          Hide
          Mike Matrigali added a comment -

          more detail on the 3/7 weme occurrence:
          http://people.apache.org/~myrnavl/derby_test_results/main/windows/testlog/weme6.2/1079089-suites.All_diff.txt
          There was 1 failure:
          1) testShutdownWhileScanningThenDelete(org.apache.derbyTesting.functionTests.tests.store.AutomaticIndexStatisticsTest)junit.framework.AssertionFailedError: Failed to delete 3 files (root=C:\jartest\JarResults.2011-03-07\weme6.2_suites.All\system\singleUse\copyShutdown: C:\jartest\JarResults.2011-03-07\weme6.2_suites.All\system\singleUse\copyShutdown\seg0\c481.dat (isDir=false, canRead=true, canWrite=true, size=22351872), C:\jartest\JarResults.2011-03-07\weme6.2_suites.All\system\singleUse\copyShutdown\seg0 (isDir=true, canRead=true, canWrite=true, size=0), C:\jartest\JarResults.2011-03-07\weme6.2_suites.All\system\singleUse\copyShutdown (isDir=true, canRead=true, canWrite=true, size=0)
          at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertDirectoryDeleted(BaseJDBCTestCase.java:1526)
          at org.apache.derbyTesting.functionTests.tests.store.AutomaticIndexStatisticsTest.testShutdownWhileScanningThenDelete(AutomaticIndexStatisticsTest.java:188)
          at java.lang.reflect.AccessibleObject.invokeV(AccessibleObject.java:195)
          at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:112)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:19)
          at junit.extensions.TestSetup.run(TestSetup.java:23)
          at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:19)
          at junit.extensions.TestSetup.run(TestSetup.java:23)

          Show
          Mike Matrigali added a comment - more detail on the 3/7 weme occurrence: http://people.apache.org/~myrnavl/derby_test_results/main/windows/testlog/weme6.2/1079089-suites.All_diff.txt There was 1 failure: 1) testShutdownWhileScanningThenDelete(org.apache.derbyTesting.functionTests.tests.store.AutomaticIndexStatisticsTest)junit.framework.AssertionFailedError: Failed to delete 3 files (root=C:\jartest\JarResults.2011-03-07\weme6.2_suites.All\system\singleUse\copyShutdown: C:\jartest\JarResults.2011-03-07\weme6.2_suites.All\system\singleUse\copyShutdown\seg0\c481.dat (isDir=false, canRead=true, canWrite=true, size=22351872), C:\jartest\JarResults.2011-03-07\weme6.2_suites.All\system\singleUse\copyShutdown\seg0 (isDir=true, canRead=true, canWrite=true, size=0), C:\jartest\JarResults.2011-03-07\weme6.2_suites.All\system\singleUse\copyShutdown (isDir=true, canRead=true, canWrite=true, size=0) at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertDirectoryDeleted(BaseJDBCTestCase.java:1526) at org.apache.derbyTesting.functionTests.tests.store.AutomaticIndexStatisticsTest.testShutdownWhileScanningThenDelete(AutomaticIndexStatisticsTest.java:188) at java.lang.reflect.AccessibleObject.invokeV(AccessibleObject.java:195) at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:112) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22) at junit.extensions.TestSetup$1.protect(TestSetup.java:19) at junit.extensions.TestSetup.run(TestSetup.java:23) at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22) at junit.extensions.TestSetup$1.protect(TestSetup.java:19) at junit.extensions.TestSetup.run(TestSetup.java:23)
          Kristian Waagan made changes -
          Link This issue is related to DERBY-5026 [ DERBY-5026 ]
          Hide
          Mike Matrigali added a comment -

          This issue seems to be reproducible in my environment. 3 out of 3 times so far just running the AutomaticIndexStatisticsTest alone - I will try more later. As a test I bumped up the retry to 100 and went to lunch and came back to the test still stuck trying to delete the same file. So it seems likely a bug in the code rather than not enough wait time in the test. I am going to attach a java core I took while it was looping. I see a derby.rawStoreDaemon but have not debugged enough to know if this is expected or not in this case. I don't know if the harness when running just this test case might still create the default db and leave it up while then creating and starting a separate one for the single use database case that this test uses.

          Show
          Mike Matrigali added a comment - This issue seems to be reproducible in my environment. 3 out of 3 times so far just running the AutomaticIndexStatisticsTest alone - I will try more later. As a test I bumped up the retry to 100 and went to lunch and came back to the test still stuck trying to delete the same file. So it seems likely a bug in the code rather than not enough wait time in the test. I am going to attach a java core I took while it was looping. I see a derby.rawStoreDaemon but have not debugged enough to know if this is expected or not in this case. I don't know if the harness when running just this test case might still create the default db and leave it up while then creating and starting a separate one for the single use database case that this test uses.
          Hide
          Mike Matrigali added a comment -

          snapshot of system looping while trying to delete last file in the seg0 directory.

          Show
          Mike Matrigali added a comment - snapshot of system looping while trying to delete last file in the seg0 directory.
          Mike Matrigali made changes -
          Attachment javacore.20110309.125807.4048.0001.txt [ 12473202 ]
          Myrna van Lunteren made changes -
          Link This issue is related to DERBY-5123 [ DERBY-5123 ]
          Mike Matrigali made changes -
          Assignee Mike Matrigali [ mikem ]
          Hide
          Mike Matrigali added a comment -

          I am actively working on this one. Right now I am concentrating on SANE, classes, windows failure of just this one fixture. For me on my laptop this fails every time.
          Unless anyone protests I am going to check in a change that disables just this one fixture while I am working on it. I have verified that the file that is the problem is the
          data file that istat is scanning when the shutdown happens. It stuck around for over an hour so I am going to assume the resource is stuck open at least until the thread
          that started the server goes away and in worst case until the jvm comes down. I

          I am marking this a regression, because a user previous to 10.8 that shuts down when his threads are doing nothing won't see this. But in 10.8 istat might be running
          and he will run into it. I could see this leading to problems with subsquent ddl which has to do deletes or renames of the associated table.

          My current guess is that DERBY-5037 fixed the symptom but shutdown is still failing and leaving a real resource problem. I still need to do some more debugging to
          verify what is going on. In my case I am consistently getting the ASSERT. But I believe we see the issue in nightly runs against SANE=false so it is not specific to stuff
          we do in ASSERT logic. I am hoping that fixing the SANE path will apply to the SANE=false path also.

          Logically it seems like what should happen on "clean" shutdown is first user threads should be shutdown, then istat, and finally store. I wonder if in 10.8 we can take advantage of the interrupt work to quickly and safely shutdown the user and istat threads?

          Show
          Mike Matrigali added a comment - I am actively working on this one. Right now I am concentrating on SANE, classes, windows failure of just this one fixture. For me on my laptop this fails every time. Unless anyone protests I am going to check in a change that disables just this one fixture while I am working on it. I have verified that the file that is the problem is the data file that istat is scanning when the shutdown happens. It stuck around for over an hour so I am going to assume the resource is stuck open at least until the thread that started the server goes away and in worst case until the jvm comes down. I I am marking this a regression, because a user previous to 10.8 that shuts down when his threads are doing nothing won't see this. But in 10.8 istat might be running and he will run into it. I could see this leading to problems with subsquent ddl which has to do deletes or renames of the associated table. My current guess is that DERBY-5037 fixed the symptom but shutdown is still failing and leaving a real resource problem. I still need to do some more debugging to verify what is going on. In my case I am consistently getting the ASSERT. But I believe we see the issue in nightly runs against SANE=false so it is not specific to stuff we do in ASSERT logic. I am hoping that fixing the SANE path will apply to the SANE=false path also. Logically it seems like what should happen on "clean" shutdown is first user threads should be shutdown, then istat, and finally store. I wonder if in 10.8 we can take advantage of the interrupt work to quickly and safely shutdown the user and istat threads?
          Mike Matrigali made changes -
          Bug behavior facts [Regression Test Failure] [Regression, Regression Test Failure]
          Priority Major [ 3 ] Blocker [ 1 ]
          Hide
          Kathey Marsden added a comment -

          +1 to making this a blocker. I have seen cases in the past where I have instructed users to wait for a bit after shutdown before deleting the database to avoid deletion errors. If that won't work reliably no matter how long they wait it would be extremely problematic.

          Show
          Kathey Marsden added a comment - +1 to making this a blocker. I have seen cases in the past where I have instructed users to wait for a bit after shutdown before deleting the database to avoid deletion errors. If that won't work reliably no matter how long they wait it would be extremely problematic.
          Hide
          Kristian Waagan added a comment -

          Thanks for looking into this, Mike.
          If you enable istat logging and tracing, you should be able to see which exception/error is raised when the istat daemon is shut down (i.e. is it always being raised in the same place?)

          Kathey, even though I agree this could be extremely problematic, there are some points to consider:
          o the istat daemon must be doing work as the database is shut down
          o the bug is timing dependent (it doesn't fail every time on the regression test machines, nor on my Windows Vista quad core)
          o the user must try to delete the database files
          o affects Windows platforms only

          With the points above, this issue is still not a blocker for me, especially since the feature can be disabled. I still think this issue is the most important istat bug to fix, so it would be great if it made it into 10.8.

          It is unclear to me at this point what happens resource-wise if you just skip the delete-step, reboot the database and repeat the sequence (without killing the JVM).
          One thing that will definitely make it a blocker is if the database can't be booted and/or be written into after the bug occurs (again, without killing the JVM, as would typically by the case in appserver environments).

          Show
          Kristian Waagan added a comment - Thanks for looking into this, Mike. If you enable istat logging and tracing, you should be able to see which exception/error is raised when the istat daemon is shut down (i.e. is it always being raised in the same place?) Kathey, even though I agree this could be extremely problematic, there are some points to consider: o the istat daemon must be doing work as the database is shut down o the bug is timing dependent (it doesn't fail every time on the regression test machines, nor on my Windows Vista quad core) o the user must try to delete the database files o affects Windows platforms only With the points above, this issue is still not a blocker for me, especially since the feature can be disabled. I still think this issue is the most important istat bug to fix, so it would be great if it made it into 10.8. It is unclear to me at this point what happens resource-wise if you just skip the delete-step, reboot the database and repeat the sequence (without killing the JVM). One thing that will definitely make it a blocker is if the database can't be booted and/or be written into after the bug occurs (again, without killing the JVM, as would typically by the case in appserver environments).
          Hide
          Rick Hillegas added a comment -

          Changing the urgency to Blocker. That makes this issue a showstopper for 10.8.1. If I understand Mike's analysis, we could see problems with DDL on the table in the following scenario:

          o Running on Windows
          o The application shuts down the database but the application continues to do other work.
          o The application then reboots the database for more work.

          Show
          Rick Hillegas added a comment - Changing the urgency to Blocker. That makes this issue a showstopper for 10.8.1. If I understand Mike's analysis, we could see problems with DDL on the table in the following scenario: o Running on Windows o The application shuts down the database but the application continues to do other work. o The application then reboots the database for more work.
          Rick Hillegas made changes -
          Urgency Blocker
          Hide
          Lily Wei added a comment -

          +1 to making this a blocker for this application shuts down the database then delete the database is problematic on Windows.

          Show
          Lily Wei added a comment - +1 to making this a blocker for this application shuts down the database then delete the database is problematic on Windows.
          Hide
          Mike Matrigali added a comment -

          It looks like one of the last things that happens in the istat thread prior to store peforming it's shutdown and closing all of it's open files is that
          it recovers an open from an interrupt. I'll include the stacks below that show this. What should we do if the istat thread get's an interrupt. I am leaning
          toward it should check for interrupts at critical places and then give up on the work it is doing. Maybe there is a place it could check whether it should be
          shutting down that needs to be checked more often?

          The store shutdown code only closes files that are not currently "kept" (store's term for a file that is open by a store client), so not surprising that file is
          not closed if it is executing while the istat thread is busy reading from the file). Here is the stack of the store closing it's open files:
          at org.apache.derby.impl.store.raw.data.RAFContainer.closeContainer(RAFContainer.java:216)^M
          at org.apache.derby.impl.store.raw.data.RAFContainer4.closeContainer(RAFContainer4.java:227)^M
          at org.apache.derby.impl.store.raw.data.FileContainer.clearIdentity(FileContainer.java:476)^M
          at org.apache.derby.impl.services.cache.ConcurrentCache.removeEntry(ConcurrentCache.java:167)^M
          at org.apache.derby.impl.services.cache.ConcurrentCache.ageOut(ConcurrentCache.java:583)^M
          at org.apache.derby.impl.services.cache.ConcurrentCache.shutdown(ConcurrentCache.java:598)^M
          at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.stop(BaseDataFileFactory.java:519)^M
          at org.apache.derby.impl.services.monitor.TopService.stop(TopService.java:443)^M
          at org.apache.derby.impl.services.monitor.TopService.shutdown(TopService.java:394)^M
          at org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(BaseMonitor.java:229)^M

          Here is some debug showing the stacks where the istat daemon is recovering from an interrupt: on the file of interest c481.dat - conglom/container id 1153
          Thu Mar 10 16:22:59 PST 2011 Thread[index-stat-thread,5,main]

          {istat,trace@637806084} worker thread started (xid=24016) [q/p/s=1/0/1,err:k/u/c=0/0/0,rej:f/d/
          /0/0]^M
          Thu Mar 10 16:22:59 PST 2011 Thread[index-stat-thread,5,main] {istat,trace@637806084}

          processing "APP"."BIG_TABLE" ^M
          DEBUG IndexStatisticsDaemonImpl OUTPUT: opened conglomerateNumber[1] = 1153^M
          DEBUG RAFContainer OUTPUT: about to close file C:\derby\s1\systest\out\junit\system\singleUse\copyShutdown\seg0\c481.dat^M
          Exception trace: ^M
          java.lang.Throwable^M
          at org.apache.derby.impl.store.raw.data.RAFContainer.closeContainer(RAFContainer.java:216)^M
          at org.apache.derby.impl.store.raw.data.RAFContainer4.closeContainer(RAFContainer4.java:227)^M
          at org.apache.derby.impl.store.raw.data.RAFContainer4.recoverContainerAfterInterrupt(RAFContainer4.java:860)^M
          at org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(RAFContainer4.java:368)^M
          at org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(RAFContainer4.java:246)^M
          at org.apache.derby.impl.store.raw.data.CachedPage.readPage(CachedPage.java:673)^M
          at org.apache.derby.impl.store.raw.data.CachedPage.setIdentity(CachedPage.java:193)^M
          at org.apache.derby.impl.services.cache.ConcurrentCache.find(ConcurrentCache.java:295)^M
          at org.apache.derby.impl.store.raw.data.FileContainer.getUserPage(FileContainer.java:2530)^M
          at org.apache.derby.impl.store.raw.data.FileContainer.getPage(FileContainer.java:2580)^M
          at org.apache.derby.impl.store.raw.data.BaseContainerHandle.getPage(BaseContainerHandle.java:319)^M
          at org.apache.derby.impl.store.access.btree.ControlRow.get(ControlRow.java:833)^M
          at org.apache.derby.impl.store.access.btree.ControlRow.get(ControlRow.java:820)^M
          at org.apache.derby.impl.store.access.btree.ControlRow.getRightSibling(ControlRow.java:531)^M
          at org.apache.derby.impl.store.access.btree.BTreeScan.positionAtNextPage(BTreeScan.java:493)^M
          at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:464)^M
          at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNextGroup(BTreeScan.java:1681)^M
          at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl$KeyComparator.fetchRows(IndexStatisticsDaemonImpl.java:1174)^M
          at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.updateIndexStatsMinion(IndexStatisticsDaemonImpl.java:470)^M
          at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.generateStatistics(IndexStatisticsDaemonImpl.java:327)^M
          at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.processingLoop(IndexStatisticsDaemonImpl.java:777)^M
          at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.run(IndexStatisticsDaemonImpl.java:687)^M
          at java.lang.Thread.run(Thread.java:736)^M
          DEBUG RAFContainer OUTPUT: closed file C:\derby\s1\systest\out\junit\system\singleUse\copyShutdown\seg0\c481.dat^M
          DEBUG RAFContainer OUTPUT: opening file: C:\derby\s1\systest\out\junit\system\singleUse\copyShutdown\seg0\c481.dat^M
          Exception trace: ^M
          java.lang.Throwable^M
          at org.apache.derby.impl.store.raw.data.RAFContainer.run(RAFContainer.java:1587)^M
          at java.security.AccessController.doPrivileged(AccessController.java:251)^M
          at org.apache.derby.impl.store.raw.data.RAFContainer.openContainerMinion(RAFContainer.java:939)^M
          at org.apache.derby.impl.store.raw.data.RAFContainer.reopenContainer(RAFContainer.java:914)^M
          at org.apache.derby.impl.store.raw.data.RAFContainer4.recoverContainerAfterInterrupt(RAFContainer4.java:861)^M
          at org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(RAFContainer4.java:368)^M
          at org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(RAFContainer4.java:246)^M
          at org.apache.derby.impl.store.raw.data.CachedPage.readPage(CachedPage.java:673)^M
          at org.apache.derby.impl.store.raw.data.CachedPage.setIdentity(CachedPage.java:193)^M
          at org.apache.derby.impl.services.cache.ConcurrentCache.find(ConcurrentCache.java:295)^M
          at org.apache.derby.impl.store.raw.data.FileContainer.getUserPage(FileContainer.java:2530)^M
          at org.apache.derby.impl.store.raw.data.FileContainer.getPage(FileContainer.java:2580)^M
          at org.apache.derby.impl.store.raw.data.BaseContainerHandle.getPage(BaseContainerHandle.java:319)^M
          at org.apache.derby.impl.store.access.btree.ControlRow.get(ControlRow.java:833)^M
          at org.apache.derby.impl.store.access.btree.ControlRow.get(ControlRow.java:820)^M
          at org.apache.derby.impl.store.access.btree.ControlRow.getRightSibling(ControlRow.java:531)^M
          at org.apache.derby.impl.store.access.btree.BTreeScan.positionAtNextPage(BTreeScan.java:493)^M
          at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:464)^M
          at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNextGroup(BTreeScan.java:1681)^M
          at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl$KeyComparator.fetchRows(IndexStatisticsDaemonImpl.java:1174)^M
          at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.updateIndexStatsMinion(IndexStatisticsDaemonImpl.java:470)^M
          at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.generateStatistics(IndexStatisticsDaemonImpl.java:327)^M
          at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.processingLoop(IndexStatisticsDaemonImpl.java:777)^M
          at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.run(IndexStatisticsDaemonImpl.java:687)^M
          at java.lang.Thread.run(Thread.java:736)^M
          ----------------------------------------------------------------^M
          ----------------------------------------------------------------^M
          Thu Mar 10 16:22:59 PST 2011:
          Shutting down instance 35924132-012e-a249-3f92-ffffe527380d on database directory C:\derby\s1\systest\out\junit\system\singleUse\copyShutdown with class loade
          sun.misc.Launcher$AppClassLoader@49be49be ^M
          shutting down:
          isCorrupt = false
          pageCache = org.apache.derby.impl.services.cache.ConcurrentCache@1c2a1c2a
          containerCache = org.apache.derby.impl.services.cache.ConcurrentCache@20fa20fa^M

          Show
          Mike Matrigali added a comment - It looks like one of the last things that happens in the istat thread prior to store peforming it's shutdown and closing all of it's open files is that it recovers an open from an interrupt. I'll include the stacks below that show this. What should we do if the istat thread get's an interrupt. I am leaning toward it should check for interrupts at critical places and then give up on the work it is doing. Maybe there is a place it could check whether it should be shutting down that needs to be checked more often? The store shutdown code only closes files that are not currently "kept" (store's term for a file that is open by a store client), so not surprising that file is not closed if it is executing while the istat thread is busy reading from the file). Here is the stack of the store closing it's open files: at org.apache.derby.impl.store.raw.data.RAFContainer.closeContainer(RAFContainer.java:216)^M at org.apache.derby.impl.store.raw.data.RAFContainer4.closeContainer(RAFContainer4.java:227)^M at org.apache.derby.impl.store.raw.data.FileContainer.clearIdentity(FileContainer.java:476)^M at org.apache.derby.impl.services.cache.ConcurrentCache.removeEntry(ConcurrentCache.java:167)^M at org.apache.derby.impl.services.cache.ConcurrentCache.ageOut(ConcurrentCache.java:583)^M at org.apache.derby.impl.services.cache.ConcurrentCache.shutdown(ConcurrentCache.java:598)^M at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.stop(BaseDataFileFactory.java:519)^M at org.apache.derby.impl.services.monitor.TopService.stop(TopService.java:443)^M at org.apache.derby.impl.services.monitor.TopService.shutdown(TopService.java:394)^M at org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(BaseMonitor.java:229)^M Here is some debug showing the stacks where the istat daemon is recovering from an interrupt: on the file of interest c481.dat - conglom/container id 1153 Thu Mar 10 16:22:59 PST 2011 Thread [index-stat-thread,5,main] {istat,trace@637806084} worker thread started (xid=24016) [q/p/s=1/0/1,err:k/u/c=0/0/0,rej:f/d/ /0/0]^M Thu Mar 10 16:22:59 PST 2011 Thread [index-stat-thread,5,main] {istat,trace@637806084} processing "APP"."BIG_TABLE" ^M DEBUG IndexStatisticsDaemonImpl OUTPUT: opened conglomerateNumber [1] = 1153^M DEBUG RAFContainer OUTPUT: about to close file C:\derby\s1\systest\out\junit\system\singleUse\copyShutdown\seg0\c481.dat^M Exception trace: ^M java.lang.Throwable^M at org.apache.derby.impl.store.raw.data.RAFContainer.closeContainer(RAFContainer.java:216)^M at org.apache.derby.impl.store.raw.data.RAFContainer4.closeContainer(RAFContainer4.java:227)^M at org.apache.derby.impl.store.raw.data.RAFContainer4.recoverContainerAfterInterrupt(RAFContainer4.java:860)^M at org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(RAFContainer4.java:368)^M at org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(RAFContainer4.java:246)^M at org.apache.derby.impl.store.raw.data.CachedPage.readPage(CachedPage.java:673)^M at org.apache.derby.impl.store.raw.data.CachedPage.setIdentity(CachedPage.java:193)^M at org.apache.derby.impl.services.cache.ConcurrentCache.find(ConcurrentCache.java:295)^M at org.apache.derby.impl.store.raw.data.FileContainer.getUserPage(FileContainer.java:2530)^M at org.apache.derby.impl.store.raw.data.FileContainer.getPage(FileContainer.java:2580)^M at org.apache.derby.impl.store.raw.data.BaseContainerHandle.getPage(BaseContainerHandle.java:319)^M at org.apache.derby.impl.store.access.btree.ControlRow.get(ControlRow.java:833)^M at org.apache.derby.impl.store.access.btree.ControlRow.get(ControlRow.java:820)^M at org.apache.derby.impl.store.access.btree.ControlRow.getRightSibling(ControlRow.java:531)^M at org.apache.derby.impl.store.access.btree.BTreeScan.positionAtNextPage(BTreeScan.java:493)^M at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:464)^M at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNextGroup(BTreeScan.java:1681)^M at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl$KeyComparator.fetchRows(IndexStatisticsDaemonImpl.java:1174)^M at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.updateIndexStatsMinion(IndexStatisticsDaemonImpl.java:470)^M at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.generateStatistics(IndexStatisticsDaemonImpl.java:327)^M at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.processingLoop(IndexStatisticsDaemonImpl.java:777)^M at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.run(IndexStatisticsDaemonImpl.java:687)^M at java.lang.Thread.run(Thread.java:736)^M DEBUG RAFContainer OUTPUT: closed file C:\derby\s1\systest\out\junit\system\singleUse\copyShutdown\seg0\c481.dat^M DEBUG RAFContainer OUTPUT: opening file: C:\derby\s1\systest\out\junit\system\singleUse\copyShutdown\seg0\c481.dat^M Exception trace: ^M java.lang.Throwable^M at org.apache.derby.impl.store.raw.data.RAFContainer.run(RAFContainer.java:1587)^M at java.security.AccessController.doPrivileged(AccessController.java:251)^M at org.apache.derby.impl.store.raw.data.RAFContainer.openContainerMinion(RAFContainer.java:939)^M at org.apache.derby.impl.store.raw.data.RAFContainer.reopenContainer(RAFContainer.java:914)^M at org.apache.derby.impl.store.raw.data.RAFContainer4.recoverContainerAfterInterrupt(RAFContainer4.java:861)^M at org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(RAFContainer4.java:368)^M at org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(RAFContainer4.java:246)^M at org.apache.derby.impl.store.raw.data.CachedPage.readPage(CachedPage.java:673)^M at org.apache.derby.impl.store.raw.data.CachedPage.setIdentity(CachedPage.java:193)^M at org.apache.derby.impl.services.cache.ConcurrentCache.find(ConcurrentCache.java:295)^M at org.apache.derby.impl.store.raw.data.FileContainer.getUserPage(FileContainer.java:2530)^M at org.apache.derby.impl.store.raw.data.FileContainer.getPage(FileContainer.java:2580)^M at org.apache.derby.impl.store.raw.data.BaseContainerHandle.getPage(BaseContainerHandle.java:319)^M at org.apache.derby.impl.store.access.btree.ControlRow.get(ControlRow.java:833)^M at org.apache.derby.impl.store.access.btree.ControlRow.get(ControlRow.java:820)^M at org.apache.derby.impl.store.access.btree.ControlRow.getRightSibling(ControlRow.java:531)^M at org.apache.derby.impl.store.access.btree.BTreeScan.positionAtNextPage(BTreeScan.java:493)^M at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:464)^M at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNextGroup(BTreeScan.java:1681)^M at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl$KeyComparator.fetchRows(IndexStatisticsDaemonImpl.java:1174)^M at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.updateIndexStatsMinion(IndexStatisticsDaemonImpl.java:470)^M at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.generateStatistics(IndexStatisticsDaemonImpl.java:327)^M at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.processingLoop(IndexStatisticsDaemonImpl.java:777)^M at org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.run(IndexStatisticsDaemonImpl.java:687)^M at java.lang.Thread.run(Thread.java:736)^M ----------------------------------------------------------------^M ----------------------------------------------------------------^M Thu Mar 10 16:22:59 PST 2011: Shutting down instance 35924132-012e-a249-3f92-ffffe527380d on database directory C:\derby\s1\systest\out\junit\system\singleUse\copyShutdown with class loade sun.misc.Launcher$AppClassLoader@49be49be ^M shutting down: isCorrupt = false pageCache = org.apache.derby.impl.services.cache.ConcurrentCache@1c2a1c2a containerCache = org.apache.derby.impl.services.cache.ConcurrentCache@20fa20fa^M
          Hide
          Mike Matrigali added a comment - - edited

          The more I look at this issue I think the problem is that the istat daemon should shutdown and not return until it has completed this
          shutdown when indexRefresher.stop(); is called from the DataDictionary's stop call(). For a clean shutdown of the system the store
          needs all it's clients shutdown first and then it can cleanly shutdown, and force the database files and transaction logs insuring
          a clean shutdown with no recovery work necessary on the next boot.

          By leaving the istat daemon running we can run into a number of errors that I don't think can be solved. We might fix a specific one shown
          up by this test but the system is just not designed to handle clean shutdown while stuff is still running without first waiting for the running
          stuff to stop somehow.

          Kristian noted in DERBY-5037:
          > I think Mike's comments/observations above agree pretty much with my thinking when writing the code. Seems there are several error-handling issues to iron out though...
          >A few specific comments:
          >o I decided to not make Derby wait for the background thread to finish on shutdown, as it might potentially be scanning a very large table.
          >o Logging is rather verbose now during testing, but I agree it should be less verbose (or maybe turned off completely) when released.
          >o I'm logging a lot of exceptions to aid testing/debugging. These should also go away, or be enabled by a property if the user wishes to do so.

          I now think that it was wrong to not wait for the background thread. This would match the behavior of the rawStoreDaemon thread which is "owned"
          by the raw store module - the module stops the daemon and the daemon waits around for work to stop/complete before returning from the stop, and
          then the raw store continues with it's data and transaction file cleanup prior to stopping. I agree it would be a nice optimization to somehow stop the background thread in
          the middle of a big scan, and it seems like with the better interrupt support this should be much easier than was the case before 10.8. I would like
          some feedback before proceding from those more knowledgeable about the istat work.

          I do think that the work rick did for DERBY-5037 is still valuable as it will handle much better the non-clean shutdowns that Derby can experience. ButFor
          a non-clean shutdown we might have to just live with a file left open until the thread or jvm exits. But for a requested orderly shutdown of the system I
          think we should go with the top down shutdown supported by the architecture rather than try to fix errors encountered when top level modules are still
          running while lower level modules are trying to shut down.

          Show
          Mike Matrigali added a comment - - edited The more I look at this issue I think the problem is that the istat daemon should shutdown and not return until it has completed this shutdown when indexRefresher.stop(); is called from the DataDictionary's stop call(). For a clean shutdown of the system the store needs all it's clients shutdown first and then it can cleanly shutdown, and force the database files and transaction logs insuring a clean shutdown with no recovery work necessary on the next boot. By leaving the istat daemon running we can run into a number of errors that I don't think can be solved. We might fix a specific one shown up by this test but the system is just not designed to handle clean shutdown while stuff is still running without first waiting for the running stuff to stop somehow. Kristian noted in DERBY-5037 : > I think Mike's comments/observations above agree pretty much with my thinking when writing the code. Seems there are several error-handling issues to iron out though... >A few specific comments: >o I decided to not make Derby wait for the background thread to finish on shutdown, as it might potentially be scanning a very large table. >o Logging is rather verbose now during testing, but I agree it should be less verbose (or maybe turned off completely) when released. >o I'm logging a lot of exceptions to aid testing/debugging. These should also go away, or be enabled by a property if the user wishes to do so. I now think that it was wrong to not wait for the background thread. This would match the behavior of the rawStoreDaemon thread which is "owned" by the raw store module - the module stops the daemon and the daemon waits around for work to stop/complete before returning from the stop, and then the raw store continues with it's data and transaction file cleanup prior to stopping. I agree it would be a nice optimization to somehow stop the background thread in the middle of a big scan, and it seems like with the better interrupt support this should be much easier than was the case before 10.8. I would like some feedback before proceding from those more knowledgeable about the istat work. I do think that the work rick did for DERBY-5037 is still valuable as it will handle much better the non-clean shutdowns that Derby can experience. ButFor a non-clean shutdown we might have to just live with a file left open until the thread or jvm exits. But for a requested orderly shutdown of the system I think we should go with the top down shutdown supported by the architecture rather than try to fix errors encountered when top level modules are still running while lower level modules are trying to shut down.
          Hide
          Knut Anders Hatlen added a comment -

          Engine shutdown interrupts all threads that are active inside Derby code. I'm wondering if that's the interrupt the istat thread is experiencing. DERBY-4920 was similar, only then it was a checkpoint that was interrupted by the engine shutdown and reopened the channel. I don't know if we want to handle interrupts differently for the istat daemon than we do for other threads. I'd rather see that the reopening logic in store knew that it shouldn't attempt to reopen channels if the engine is about to shut down. I thought it already had some logic to take care of that, but I'm not sure.

          Waiting for the istat thread before completing shutdown sounds OK. I think the reason why it doesn't do that now, was a concern that it may take a very long time if it's processing a large index. But shutdown also performs a checkpoint, which may take a long time if the page cache is big. And if the istat thread is also interrupted by the engine shutdown, it should stop rather quickly (if we only can make it stop reopening the channel).

          I'm wondering, though, if this problem is istat-specific, or if we would see the same problem if we called SYSCS_UPDATE_STATISTICS on a large index just before shutting down the database?

          Show
          Knut Anders Hatlen added a comment - Engine shutdown interrupts all threads that are active inside Derby code. I'm wondering if that's the interrupt the istat thread is experiencing. DERBY-4920 was similar, only then it was a checkpoint that was interrupted by the engine shutdown and reopened the channel. I don't know if we want to handle interrupts differently for the istat daemon than we do for other threads. I'd rather see that the reopening logic in store knew that it shouldn't attempt to reopen channels if the engine is about to shut down. I thought it already had some logic to take care of that, but I'm not sure. Waiting for the istat thread before completing shutdown sounds OK. I think the reason why it doesn't do that now, was a concern that it may take a very long time if it's processing a large index. But shutdown also performs a checkpoint, which may take a long time if the page cache is big. And if the istat thread is also interrupted by the engine shutdown, it should stop rather quickly (if we only can make it stop reopening the channel). I'm wondering, though, if this problem is istat-specific, or if we would see the same problem if we called SYSCS_UPDATE_STATISTICS on a large index just before shutting down the database?
          Hide
          Knut Anders Hatlen added a comment -

          I invoked an engine shutdown while a call to SYSCS_UTIL.SYSCS_UPDATE_STATISTICS was in progress, and then I observed that the process still had an open file handle to the index file after the shutdown had completed. I had disabled istat when I ran this experiment.

          Show
          Knut Anders Hatlen added a comment - I invoked an engine shutdown while a call to SYSCS_UTIL.SYSCS_UPDATE_STATISTICS was in progress, and then I observed that the process still had an open file handle to the index file after the shutdown had completed. I had disabled istat when I ran this experiment.
          Rick Hillegas made changes -
          Link This issue is related to DERBY-5131 [ DERBY-5131 ]
          Hide
          Knut Anders Hatlen added a comment -

          I don't think we need to use interrupts to shut down the istat thread more quickly. IndexStatisticsDaemonImpl.stop() sets a flag daemonDisabled that we can check every now and then while scanning the indexes. Currently, the flag is checked right before the scan starts, but I was thinking we could check that flag for example each time we've fetched a new chunk of rows from store. The attached patch does just that, and also makes stop() wait for the istat thread to complete if there is an active thread.

          Does that help on the machine where you'll able to reproduce the problem, Mike?

          Show
          Knut Anders Hatlen added a comment - I don't think we need to use interrupts to shut down the istat thread more quickly. IndexStatisticsDaemonImpl.stop() sets a flag daemonDisabled that we can check every now and then while scanning the indexes. Currently, the flag is checked right before the scan starts, but I was thinking we could check that flag for example each time we've fetched a new chunk of rows from store. The attached patch does just that, and also makes stop() wait for the istat thread to complete if there is an active thread. Does that help on the machine where you'll able to reproduce the problem, Mike?
          Knut Anders Hatlen made changes -
          Attachment waitOnShutdown.diff [ 12473569 ]
          Hide
          Lily Wei added a comment -

          Thanks Knut for the patch. I applied to my windows 7 machine. I used to able to reproduce the delete problem quickly. After running AutomaticIndexStatisticsTest 5 times, the delete problem did not show up. With waitOnShutdown.diff patch, checking the daemonDisabled flag each time we fetch a new chuck of rows from store and making stop() wait for istat thread to complete if there is an active thread definitely help. Does this approach fulfill the top down shutdown supported by the architecture?

          Show
          Lily Wei added a comment - Thanks Knut for the patch. I applied to my windows 7 machine. I used to able to reproduce the delete problem quickly. After running AutomaticIndexStatisticsTest 5 times, the delete problem did not show up. With waitOnShutdown.diff patch, checking the daemonDisabled flag each time we fetch a new chuck of rows from store and making stop() wait for istat thread to complete if there is an active thread definitely help. Does this approach fulfill the top down shutdown supported by the architecture?
          Hide
          Knut Anders Hatlen added a comment - - edited

          Thanks for testing the patch, Lily!

          > Does this approach fulfill the top down shutdown supported by the architecture?

          Yes, I think so. The istat service is shut down from the data dictionary, which is top-down. That was done before as well, but the actual worker thread would be left running until it finished by itself. With the patch, stop() won't complete until the worker thread is done, so when the higher levels proceed in the shutdown code, we know that the istat thread won't be causing any bad interactions.

          Show
          Knut Anders Hatlen added a comment - - edited Thanks for testing the patch, Lily! > Does this approach fulfill the top down shutdown supported by the architecture? Yes, I think so. The istat service is shut down from the data dictionary, which is top-down. That was done before as well, but the actual worker thread would be left running until it finished by itself. With the patch, stop() won't complete until the worker thread is done, so when the higher levels proceed in the shutdown code, we know that the istat thread won't be causing any bad interactions.
          Hide
          Mike Matrigali added a comment -

          thanks for the work knut. Your patch looks good to me so far, I am running my single test now.

          Even though update statistics in a user thread exhibits the same symptom I would argue that it is much more serious in a default on istat release.
          A user shutting down
          while actively running can work around the issue. But the background process is not in their control (other than disabling it entirely). So doing more
          work to behave nicely during shutdown in istat important in my opinion. We should probably log a separate JIRA for your update statistics test case. I
          assume other work by user threads probably has a similar problem. It would be good to know if this is a regression caused by the improved interrupt
          handling.

          You bring up some interesting thoughts on interrupts and shutdown. I think we should start a thread and discuss that separately from this JIRA, deciding first
          what is the right thing to do then how to do it.

          Show
          Mike Matrigali added a comment - thanks for the work knut. Your patch looks good to me so far, I am running my single test now. Even though update statistics in a user thread exhibits the same symptom I would argue that it is much more serious in a default on istat release. A user shutting down while actively running can work around the issue. But the background process is not in their control (other than disabling it entirely). So doing more work to behave nicely during shutdown in istat important in my opinion. We should probably log a separate JIRA for your update statistics test case. I assume other work by user threads probably has a similar problem. It would be good to know if this is a regression caused by the improved interrupt handling. You bring up some interesting thoughts on interrupts and shutdown. I think we should start a thread and discuss that separately from this JIRA, deciding first what is the right thing to do then how to do it.
          Hide
          Mike Matrigali added a comment - - edited

          In my environment just got 4 for 4 errors without the patch and 4 for 4 sucesses with the patch. I will submit a fix once I can get a clean full test run, probably late night pacific time.

          4 out of 4 worked and 5th failed. So fix is better but something still going on, will continue working ...

          Show
          Mike Matrigali added a comment - - edited In my environment just got 4 for 4 errors without the patch and 4 for 4 sucesses with the patch. I will submit a fix once I can get a clean full test run, probably late night pacific time. 4 out of 4 worked and 5th failed. So fix is better but something still going on, will continue working ...
          Hide
          Rick Hillegas added a comment -

          Reducing the urgency of this issue. Now that istat is not turned on by default for 10.8.1, this issue is not a blocker for the release.

          Show
          Rick Hillegas added a comment - Reducing the urgency of this issue. Now that istat is not turned on by default for 10.8.1, this issue is not a blocker for the release.
          Rick Hillegas made changes -
          Urgency Blocker Normal
          Hide
          Knut Anders Hatlen added a comment -

          There may be other issues than the istat daemon coming into play here. I see that RawStore.stop() doesn't wait for the checkpoint to complete if there already is a checkpoint in progress, so there may be similar problems in that code. A related problem with checkpoints reopening containers during shutdown was fixed in DERBY-4920.

          Show
          Knut Anders Hatlen added a comment - There may be other issues than the istat daemon coming into play here. I see that RawStore.stop() doesn't wait for the checkpoint to complete if there already is a checkpoint in progress, so there may be similar problems in that code. A related problem with checkpoints reopening containers during shutdown was fixed in DERBY-4920 .
          Hide
          Knut Anders Hatlen added a comment -

          I added some println statements to TopService.shutdown() to see in which order the modules were taken down. I found that BaseDataFileFactoryJ4 was stopped before DataDictionaryImpl (which also stops the istat daemon). BaseDataFileFactoryJ4 shuts down the page cache and the container cache, so ideally no I/O should happen after that point. Since we don't stop the istat thread until after the caches have been shut down, that may explain why we still see that files are being reopened. I think it's the container cache that closes the files.

          BaseMonitor sends an interrupt to all active threads before invoking TopService.shutdown(), so maybe what's happening is something like this:

          1) BaseMonitor interrupts the threads and goes on shutting down the modules.

          2) The istat thread notices the interrupt inside the store code and tries reopening the container that was closed by the interrupt. But before it has completed that task, ...

          3) The data file factory is shut down, and all data files are closed when the container cache shuts down.

          4) The istat thread finishes the reopening of the index file.

          5) DataDictionaryImpl stops the istat thread, but too late.

          Show
          Knut Anders Hatlen added a comment - I added some println statements to TopService.shutdown() to see in which order the modules were taken down. I found that BaseDataFileFactoryJ4 was stopped before DataDictionaryImpl (which also stops the istat daemon). BaseDataFileFactoryJ4 shuts down the page cache and the container cache, so ideally no I/O should happen after that point. Since we don't stop the istat thread until after the caches have been shut down, that may explain why we still see that files are being reopened. I think it's the container cache that closes the files. BaseMonitor sends an interrupt to all active threads before invoking TopService.shutdown(), so maybe what's happening is something like this: 1) BaseMonitor interrupts the threads and goes on shutting down the modules. 2) The istat thread notices the interrupt inside the store code and tries reopening the container that was closed by the interrupt. But before it has completed that task, ... 3) The data file factory is shut down, and all data files are closed when the container cache shuts down. 4) The istat thread finishes the reopening of the index file. 5) DataDictionaryImpl stops the istat thread, but too late.
          Hide
          Mike Matrigali added a comment -

          I am not actively working on this any more. I think it should be left open until after istat is re-enabled or the istat tests are changed to run even if istat is disabled by default. Then should be closed if we go a reasonable amount of time without seeing it in nightly's or reported by anyone.

          Show
          Mike Matrigali added a comment - I am not actively working on this any more. I think it should be left open until after istat is re-enabled or the istat tests are changed to run even if istat is disabled by default. Then should be closed if we go a reasonable amount of time without seeing it in nightly's or reported by anyone.
          Mike Matrigali made changes -
          Assignee Mike Matrigali [ mikem ]
          Hide
          Knut Anders Hatlen added a comment -

          > if we go a reasonable amount of time without seeing it in nightly's or reported by anyone

          Didn't you say you still saw the problem in one out of five runs with the fix?

          Show
          Knut Anders Hatlen added a comment - > if we go a reasonable amount of time without seeing it in nightly's or reported by anyone Didn't you say you still saw the problem in one out of five runs with the fix?
          Hide
          Mike Matrigali added a comment -

          yesterday I tried exactly 5 runs using the exact submitted patch and it failed once. I then modified the patch, mostly adding comments and slightly different break structure. I then ran it exactly 40 times and it did not reproduce. I then ran full set of tests in a codeline with istat still enabled as default and they all passed and checked it in - but had to update my codeline to do the submit. I have not run it since. It is a little harder to run since istat does not run by default - which was why I was looking for a checked in solution to running the tests no matter what the default happened to be. I just have not gotten around to hacking a codeline or test run to do it since. I wish the disabling had waited and happened solely in the branch if necessary, so that we could be getting testing on istat in the trunk now.

          Show
          Mike Matrigali added a comment - yesterday I tried exactly 5 runs using the exact submitted patch and it failed once. I then modified the patch, mostly adding comments and slightly different break structure. I then ran it exactly 40 times and it did not reproduce. I then ran full set of tests in a codeline with istat still enabled as default and they all passed and checked it in - but had to update my codeline to do the submit. I have not run it since. It is a little harder to run since istat does not run by default - which was why I was looking for a checked in solution to running the tests no matter what the default happened to be. I just have not gotten around to hacking a codeline or test run to do it since. I wish the disabling had waited and happened solely in the branch if necessary, so that we could be getting testing on istat in the trunk now.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks, Mike. Sounds like it fixed most cases then. I still suspect the shutdown order is problematic, so I'd like to keep the bug open at least until we've investigated and found out if it actually is a problem.

          As to running AutomaticIndexStatisticsTest on head of trunk when istat is disabled, I found that I could do that by adding -Dderby.storage.indexStats.auto=true to the java command when invoking the test runner. I would think that it should also work on suites.All.

          Show
          Knut Anders Hatlen added a comment - Thanks, Mike. Sounds like it fixed most cases then. I still suspect the shutdown order is problematic, so I'd like to keep the bug open at least until we've investigated and found out if it actually is a problem. As to running AutomaticIndexStatisticsTest on head of trunk when istat is disabled, I found that I could do that by adding -Dderby.storage.indexStats.auto=true to the java command when invoking the test runner. I would think that it should also work on suites.All.
          Hide
          Dag H. Wanvik added a comment -

          Attaching a small fix to handle the case of the thread stopping the daemon being interrupted while doing the IndexStatisticsDaemonImpl#join introduced in this issue: note the interrupt and retry the join. I think this is more in keeping with what we do elsewhere now.

          Running regressions.

          Show
          Dag H. Wanvik added a comment - Attaching a small fix to handle the case of the thread stopping the daemon being interrupted while doing the IndexStatisticsDaemonImpl#join introduced in this issue: note the interrupt and retry the join. I think this is more in keeping with what we do elsewhere now. Running regressions.
          Dag H. Wanvik made changes -
          Attachment derby-5108-2.diff [ 12473929 ]
          Hide
          Dag H. Wanvik added a comment -

          Regression ran ok with my patch.

          Show
          Dag H. Wanvik added a comment - Regression ran ok with my patch.
          Dag H. Wanvik made changes -
          Issue & fix info [Patch Available]
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for the followup patch, Dag. I think it makes sense to change the interrupt handling to match what we do in the rest of the engine.

          I applied the patch and added a call to Thread.interrupt() to make join() always throw an exception. I verified that it saved the interrupt status and continued waiting until the thread had completed. However, when I checked the interrupt flag after DriverManager.getConnection("...;shutdown=true") had completed, the interrupt flag was not set in the thread. Perhaps the shutdown code doesn't have the mechanism in place to restore the interrupt flag?

          Show
          Knut Anders Hatlen added a comment - Thanks for the followup patch, Dag. I think it makes sense to change the interrupt handling to match what we do in the rest of the engine. I applied the patch and added a call to Thread.interrupt() to make join() always throw an exception. I verified that it saved the interrupt status and continued waiting until the thread had completed. However, when I checked the interrupt flag after DriverManager.getConnection("...;shutdown=true") had completed, the interrupt flag was not set in the thread. Perhaps the shutdown code doesn't have the mechanism in place to restore the interrupt flag?
          Hide
          Dag H. Wanvik added a comment -

          Committed patch as svn 1085027.

          Thanks for looking at this, Knut. I thought it should restore the flag, I'll investigate it.

          Show
          Dag H. Wanvik added a comment - Committed patch as svn 1085027. Thanks for looking at this, Knut. I thought it should restore the flag, I'll investigate it.
          Hide
          Dag H. Wanvik added a comment -

          During shutdown there is a corner case where the lcc (which contains the interrupt status) is deleted before we reach the place where the flag is attempted restored. I'll file a separate issue for this.

          Show
          Dag H. Wanvik added a comment - During shutdown there is a corner case where the lcc (which contains the interrupt status) is deleted before we reach the place where the flag is attempted restored. I'll file a separate issue for this.
          Hide
          Mike Matrigali added a comment -

          I tried out the latest in trunk as of 3/26/2011 and ran the AutomaticIndexStatisticsTest suite 64 times and got 61 good runs and 3 failures all in the shutdown fixture as follows:
          There was 1 failure:
          1) testShutdownWhileScanningThenDelete(org.apache.derbyTesting.functionTests.tes
          ts.store.AutomaticIndexStatisticsTest)junit.framework.AssertionFailedError: Fail
          ed to delete 3 files (root=C:\derby\s2\newout\system\singleUse\copyShutdown: C:\
          derby\s2\newout\system\singleUse\copyShutdown\seg0\c481.dat (isDir=false, canRea
          d=true, canWrite=true, size=22351872), C:\derby\s2\newout\system\singleUse\copyS
          hutdown\seg0 (isDir=true, canRead=true, canWrite=true, size=0), C:\derby\s2\newo
          ut\system\singleUse\copyShutdown (isDir=true, canRead=true, canWrite=true, size=
          0)
          at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertDirectoryDeleted(Bas
          eJDBCTestCase.java:1526)
          at org.apache.derbyTesting.functionTests.tests.store.AutomaticIndexStatistic
          sTest.testShutdownWhileScanningThenDelete(AutomaticIndexStatisticsTest.java:188)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java
          :60)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI
          mpl.java:37)
          at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:112)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:19)
          at junit.extensions.TestSetup.run(TestSetup.java:23)
          at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:19)
          at junit.extensions.TestSetup.run(TestSetup.java:23)

          There didn't seem to be anything interesting in derby.log. I think we should enable the "extra" logging by default in
          this test while there is still and issue. I would be happy to run in my environment and reproduce if there is any
          additional logging that we think will track down the issue (even if it something we don't want checked into the
          actual codeline). Tonight I will at least run the test with the extra logging enabled from the command line.

          Show
          Mike Matrigali added a comment - I tried out the latest in trunk as of 3/26/2011 and ran the AutomaticIndexStatisticsTest suite 64 times and got 61 good runs and 3 failures all in the shutdown fixture as follows: There was 1 failure: 1) testShutdownWhileScanningThenDelete(org.apache.derbyTesting.functionTests.tes ts.store.AutomaticIndexStatisticsTest)junit.framework.AssertionFailedError: Fail ed to delete 3 files (root=C:\derby\s2\newout\system\singleUse\copyShutdown: C:\ derby\s2\newout\system\singleUse\copyShutdown\seg0\c481.dat (isDir=false, canRea d=true, canWrite=true, size=22351872), C:\derby\s2\newout\system\singleUse\copyS hutdown\seg0 (isDir=true, canRead=true, canWrite=true, size=0), C:\derby\s2\newo ut\system\singleUse\copyShutdown (isDir=true, canRead=true, canWrite=true, size= 0) at org.apache.derbyTesting.junit.BaseJDBCTestCase.assertDirectoryDeleted(Bas eJDBCTestCase.java:1526) at org.apache.derbyTesting.functionTests.tests.store.AutomaticIndexStatistic sTest.testShutdownWhileScanningThenDelete(AutomaticIndexStatisticsTest.java:188) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java :60) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI mpl.java:37) at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:112) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22) at junit.extensions.TestSetup$1.protect(TestSetup.java:19) at junit.extensions.TestSetup.run(TestSetup.java:23) at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:22) at junit.extensions.TestSetup$1.protect(TestSetup.java:19) at junit.extensions.TestSetup.run(TestSetup.java:23) There didn't seem to be anything interesting in derby.log. I think we should enable the "extra" logging by default in this test while there is still and issue. I would be happy to run in my environment and reproduce if there is any additional logging that we think will track down the issue (even if it something we don't want checked into the actual codeline). Tonight I will at least run the test with the extra logging enabled from the command line.
          Rick Hillegas made changes -
          Affects Version/s 10.8.1.1 [ 12316356 ]
          Affects Version/s 10.8.1.0 [ 12315561 ]
          Rick Hillegas made changes -
          Affects Version/s 10.8.1.1 [ 12316356 ]
          Affects Version/s 10.8.1.2 [ 12316362 ]
          Show
          Kristian Waagan added a comment - Another occurrence seen on May 8th 2011: http://people.apache.org/~myrnavl/derby_test_results/main/windows/testSummary-1100426.html
          Kristian Waagan made changes -
          Assignee Kristian Waagan [ kristwaa ]
          Hide
          Kristian Waagan added a comment -

          Attaching patch 1a, which will make Derby shut down the istat daemon before throwing the shutdown exception.
          I also tweaked the log entry to tell if the scan was aborted.

          I have tested the patch on a Windows machine and I have run the regression tests, but I had problems reproducing the error on my machine. Nonetheless, I never saw the error with the patch applied.
          If someone wants to test the patch on Windows before I commit, please let me know and I'll hold back.

          Patch ready for review.

          Show
          Kristian Waagan added a comment - Attaching patch 1a, which will make Derby shut down the istat daemon before throwing the shutdown exception. I also tweaked the log entry to tell if the scan was aborted. I have tested the patch on a Windows machine and I have run the regression tests, but I had problems reproducing the error on my machine. Nonetheless, I never saw the error with the patch applied. If someone wants to test the patch on Windows before I commit, please let me know and I'll hold back. Patch ready for review.
          Kristian Waagan made changes -
          Kristian Waagan made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Hide
          Dag H. Wanvik added a comment -

          Thanks, Kristian.

          I see that you put the call to disableIndexStatsRefresher at one point in EmbedConnection's constructor correspoding to a requested shutdown. Database shutdown is also initiated from other locations, e.g from TransactionResourceImpl#handleException and others (many to do with replication handling). Is there no need to similarly stop the stat daemon when the db is shut down from these other locations?

          Show
          Dag H. Wanvik added a comment - Thanks, Kristian. I see that you put the call to disableIndexStatsRefresher at one point in EmbedConnection's constructor correspoding to a requested shutdown. Database shutdown is also initiated from other locations, e.g from TransactionResourceImpl#handleException and others (many to do with replication handling). Is there no need to similarly stop the stat daemon when the db is shut down from these other locations?
          Hide
          Kristian Waagan added a comment -

          Hi Dag,

          There is a need to stop the istat daemon in other cases too to guarantee that this issue will never happen. There are several criteria that must occur for the issue to happen, and a few more to be able to observe it. Nevertheless, I think the takeaway is that Derby will leak file handles. How/if this affects normal Derby operation is unclear to me, and a JVM restart will clean this up.

          Attaching patch 2a.
          To make the fix broader, I moved the logic to stop the daemon to DatabaseContextImpl.cleanupOnError.
          Do you reckon this is sufficient, or are you aware of other location that will bypass that code?

          The regression tests ran without failures with patch 2a, but I have not yet tested this on Windows (I'll try to start a test loop later today when I can access a Windows machine).

          Patch ready for review.

          Show
          Kristian Waagan added a comment - Hi Dag, There is a need to stop the istat daemon in other cases too to guarantee that this issue will never happen. There are several criteria that must occur for the issue to happen, and a few more to be able to observe it. Nevertheless, I think the takeaway is that Derby will leak file handles. How/if this affects normal Derby operation is unclear to me, and a JVM restart will clean this up. Attaching patch 2a. To make the fix broader, I moved the logic to stop the daemon to DatabaseContextImpl.cleanupOnError. Do you reckon this is sufficient, or are you aware of other location that will bypass that code? The regression tests ran without failures with patch 2a, but I have not yet tested this on Windows (I'll try to start a test loop later today when I can access a Windows machine). Patch ready for review.
          Kristian Waagan made changes -
          Hide
          Dag H. Wanvik added a comment -

          I think the move should cover it. +1

          Show
          Dag H. Wanvik added a comment - I think the move should cover it. +1
          Hide
          Dag H. Wanvik added a comment -

          I was just wondering about the code ca line 339-342 in TransactionResourceImpl.. could that bypass it,? Note that
          this is part of the old interrupt machinery to close down threads when shutting down the engine. But presumably by then you have already shut down the thread. Depends what happens first, the shutdown of the statistics thread or the call to notifyAllActiveThreads, but I see in DataBaseContextImpl#cleanupOnError you have inserted the logic before the call to notifyAllActiveThreads so you should be golden.

          Show
          Dag H. Wanvik added a comment - I was just wondering about the code ca line 339-342 in TransactionResourceImpl.. could that bypass it,? Note that this is part of the old interrupt machinery to close down threads when shutting down the engine. But presumably by then you have already shut down the thread. Depends what happens first, the shutdown of the statistics thread or the call to notifyAllActiveThreads, but I see in DataBaseContextImpl#cleanupOnError you have inserted the logic before the call to notifyAllActiveThreads so you should be golden.
          Hide
          Kristian Waagan added a comment -

          Thanks for looking at the patch, Dag.

          I think your reasoning is sound.
          Do you happen to know if other exceptions than StandardExceptions can be passed in to cleanupOnError? What happens with InterruptedExceptions?
          I assume the normal case is that all exceptions are either StandardException or other exceptions wrapped in a StandardException such that we can ask for the severity. In any case, if another kind of exception is passed into DatabaseContextImpl.cleanupOnError the code to shut down the istat daemon won't be run.

          The test case testShutdownWhileScanningThenDelete ran 236 times without a failure on a Windows machine with the patch applied (I then aborted the loop).

          Committed patch 1b to trunk with revision 1133304.
          I'll monitor the automated tests (Windows platforms) to see if the issue shows itself again with this fix.

          Show
          Kristian Waagan added a comment - Thanks for looking at the patch, Dag. I think your reasoning is sound. Do you happen to know if other exceptions than StandardExceptions can be passed in to cleanupOnError? What happens with InterruptedExceptions? I assume the normal case is that all exceptions are either StandardException or other exceptions wrapped in a StandardException such that we can ask for the severity. In any case, if another kind of exception is passed into DatabaseContextImpl.cleanupOnError the code to shut down the istat daemon won't be run. The test case testShutdownWhileScanningThenDelete ran 236 times without a failure on a Windows machine with the patch applied (I then aborted the loop). Committed patch 1b to trunk with revision 1133304. I'll monitor the automated tests (Windows platforms) to see if the issue shows itself again with this fix.
          Kristian Waagan made changes -
          Fix Version/s 10.9.0.0 [ 12316344 ]
          Issue & fix info [Patch Available]
          Hide
          Kristian Waagan added a comment -

          Attaching patch 3a, which adjusts a log message if a scan is aborted.

          Committed to trunk with revision 1133317.

          Show
          Kristian Waagan added a comment - Attaching patch 3a, which adjusts a log message if a scan is aborted. Committed to trunk with revision 1133317.
          Kristian Waagan made changes -
          Attachment derby-5108-3a-istat_log_abort.diff [ 12481795 ]
          Hide
          Dag H. Wanvik added a comment -

          Re cleanUpOnError: I believe the Throwable can be an SQLException, but essentially anything. The InterruptedExceptions
          are handled, I think, before we get this far, although I think I saw a couple of unhandled wait's the other day which is on my list of things to check.

          But there could be other unhandled Exceptions and Errors as well, cf. the call to DCI#cleanUpOnError from ContextManager#cleanupOnError.called from TRI#handleException called from EmbedConnection#handleException which is called from many API methods in catch-all (Throwable) constructs.

          EmbedConnection#handleException is also called from ConnectionChild#handleException which is used from ResultSet API methods etc...

          Note the call to wrap "other" exceptions in the call TRI#warpInSQLException towards the end of TRI#handleException.

          Show
          Dag H. Wanvik added a comment - Re cleanUpOnError: I believe the Throwable can be an SQLException, but essentially anything. The InterruptedExceptions are handled, I think, before we get this far, although I think I saw a couple of unhandled wait's the other day which is on my list of things to check. But there could be other unhandled Exceptions and Errors as well, cf. the call to DCI#cleanUpOnError from ContextManager#cleanupOnError.called from TRI#handleException called from EmbedConnection#handleException which is called from many API methods in catch-all (Throwable) constructs. EmbedConnection#handleException is also called from ConnectionChild#handleException which is used from ResultSet API methods etc... Note the call to wrap "other" exceptions in the call TRI#warpInSQLException towards the end of TRI#handleException.
          Hide
          Kristian Waagan added a comment -

          Thanks, Dag.

          Although there seems to be situations where the code shutting down the istat daemon as part of the cleanup is bypassed, I plan to leave the fix as it is for now.
          As far as I understand, non-StandardException exceptions are supposed to be wrapped (although I haven't verified if this happens before or during/after the cleanup), and unless the exception brings down the database/system there the daemon isn't supposed to be shut down. I think it should be fairly easy to change this at a later time if we find the current fix inadequate.

          I haven't seen any more occurrences of the bug so far, but I'll continue to monitor the test results for a while longer before closing. I'd also like to see if the error shows up in the 10.8 branch testing.

          Will backport to 10.8 later.

          Show
          Kristian Waagan added a comment - Thanks, Dag. Although there seems to be situations where the code shutting down the istat daemon as part of the cleanup is bypassed, I plan to leave the fix as it is for now. As far as I understand, non-StandardException exceptions are supposed to be wrapped (although I haven't verified if this happens before or during/after the cleanup), and unless the exception brings down the database/system there the daemon isn't supposed to be shut down. I think it should be fairly easy to change this at a later time if we find the current fix inadequate. I haven't seen any more occurrences of the bug so far, but I'll continue to monitor the test results for a while longer before closing. I'd also like to see if the error shows up in the 10.8 branch testing. Will backport to 10.8 later.
          Kristian Waagan made changes -
          Status In Progress [ 3 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Kristian Waagan added a comment -

          Merged missing fixes (2a and 3a) to 10.8 with revision 1140583.

          Show
          Kristian Waagan added a comment - Merged missing fixes (2a and 3a) to 10.8 with revision 1140583.
          Kristian Waagan made changes -
          Fix Version/s 10.8.1.5 [ 12316676 ]
          Hide
          Kristian Waagan added a comment -

          Closing issue.

          Show
          Kristian Waagan added a comment - Closing issue.
          Kristian Waagan made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Myrna van Lunteren added a comment -

          changing issue to change component for 10.8.2 release notes.

          Show
          Myrna van Lunteren added a comment - changing issue to change component for 10.8.2 release notes.
          Myrna van Lunteren made changes -
          Resolution Fixed [ 1 ]
          Status Closed [ 6 ] Reopened [ 4 ]
          Myrna van Lunteren made changes -
          Component/s Services [ 11415 ]
          Component/s Test [ 11413 ]
          Myrna van Lunteren made changes -
          Status Reopened [ 4 ] Closed [ 6 ]
          Resolution Fixed [ 1 ]
          Myrna van Lunteren made changes -
          Fix Version/s 10.8.2.0 [ 12317955 ]
          Fix Version/s 10.8.1.6 [ 12316676 ]
          Myrna van Lunteren made changes -
          Fix Version/s 10.8.2.2 [ 12317968 ]
          Fix Version/s 10.8.2.0 [ 12317955 ]
          Gavin made changes -
          Workflow jira [ 12601062 ] Default workflow, editable Closed status [ 12801079 ]

            People

            • Assignee:
              Kristian Waagan
              Reporter:
              Kristian Waagan
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development