Derby
  1. Derby
  2. DERBY-5624

System can run out of stack space while processing DropOnCommit requests.

    Details

    • Urgency:
      Urgent
    • Issue & fix info:
      High Value Fix, Patch Available, Repro attached

      Description

      The system currently recursively calls xact.notifyObservers() from DropOnCommit.update(). It does this because in some cases
      new observers can be added while processing the list of notifyObservers and those were being missed before the change, causing
      Assertions in the tests and possibly files not properly dropped on commit.

      Multiple users on the Derby user list have had failures running SYSCS_UTIL.SYSCS_COMPRESS_TABLE(), running out of stack track
      with a heavily recursive stack trace of the form (see more detail from these reports in subsequent comments):
      Caused by: java.lang.StackOverflowError
      at java.lang.ThreadLocal.get(ThreadLocal.java:125)
      at java.lang.StringCoding.deref(StringCoding.java:46)
      at java.lang.StringCoding.encode(StringCoding.java:258)
      at java.lang.String.getBytes(String.java:946)
      at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
      at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:228)
      at java.io.File.exists(File.java:733)
      at org.apache.derby.impl.store.raw.data.StreamFileContainer.run(Unknown Source)
      at java.security.AccessController.doPrivileged(Native Method)
      at org.apache.derby.impl.store.raw.data.StreamFileContainer.privExists(Unknown Source)
      at org.apache.derby.impl.store.raw.data.StreamFileContainer.open(Unknown Source)
      at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.openStreamContainer(Unknown Source)
      at org.apache.derby.impl.store.raw.xact.Xact.openStreamContainer(Unknown Source)
      at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.dropStreamContainer(Unknown Source)
      at org.apache.derby.impl.store.raw.xact.Xact.dropStreamContainer(Unknown Source)
      at org.apache.derby.impl.store.raw.data.DropOnCommit.update(Unknown Source)
      at java.util.Observable.notifyObservers(Observable.java:142)
      at org.apache.derby.iapi.store.raw.xact.RawTransaction.notifyObservers(Unknown Source)
      at org.apache.derby.impl.store.raw.data.DropOnCommit.update(Unknown Source)
      at java.util.Observable.notifyObservers(Observable.java:142)
      at org.apache.derby.iapi.store.raw.xact.RawTransaction.notifyObservers(Unknown Source)
      at org.apache.derby.impl.store.raw.data.DropOnCommit.update(Unknown Source)
      at java.util.Observable.notifyObservers(Observable.java:142)
      at org.apache.derby.iapi.store.raw.xact.RawTransaction.notifyObservers(Unknown Source)
      at org.apache.derby.impl.store.raw.data.DropOnCommit.update(Unknown Source)
      at java.util.Observable.notifyObservers(Observable.java:142)

      1. DERBY-5624.patch
        2 kB
        Mike Matrigali
      2. largedatasuite_mod.out
        24 kB
        Myrna van Lunteren
      3. error-stacktrace_mod.out
        10 kB
        Myrna van Lunteren

        Issue Links

          Activity

          Knut Anders Hatlen made changes -
          Link This issue is duplicated by DERBY-6556 [ DERBY-6556 ]
          Gavin made changes -
          Workflow jira [ 12654092 ] Default workflow, editable Closed status [ 12796769 ]
          Mike Matrigali made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Kathey Marsden made changes -
          Fix Version/s 10.8.3.0 [ 12323456 ]
          Fix Version/s 10.8.2.3 [ 12318540 ]
          Mike Matrigali made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Fix Version/s 10.5.3.2 [ 12315436 ]
          Fix Version/s 10.6.2.3 [ 12315434 ]
          Fix Version/s 10.7.1.4 [ 12315902 ]
          Fix Version/s 10.8.2.3 [ 12318540 ]
          Resolution Fixed [ 1 ]
          Hide
          Mike Matrigali added a comment -

          fixed in trunk and backported to 10.8, 10.7, 10.6, and 10.5.

          DERBY-5624 System can run out of stack space while processing DropOnCommit requests.

          Taking care of cleanup after a commit is handled by notifying all "Observers"
          that an event has taken place that they might want to act on and cleanup. In
          the added test case this is triggered by off line commit which effectively
          drops and recreates the base table and all of its indexes after loading the
          data into them.

          Sometimes these Observers may execute work which adds to the Observer queue,
          and that queue can "miss" them in the first pass through.

          A previous fix for this problem added a recursive call to notifyObservers in
          the place that could cause this addition of observers. This recursive call
          was causing stack problems when the number of Observers became large. For
          the checked in test case this was 1000 indexes on 1000 columns of the table.
          For other users I believe the cause was a by product of sorts on large disk
          based sorts for multi-gigabyte tables and indexes. 2 users were reporting
          similar failed stacks for failing compresses of large tables, and one was
          able to take this fix to their environment and then successfully run the
          compress.

          The fix was to remove the recursion and instead loop at the outermost point
          until there were no Observers.

          Adding the test to the largedata suite as it takes over 10 minutes to run
          on my machine.

          Show
          Mike Matrigali added a comment - fixed in trunk and backported to 10.8, 10.7, 10.6, and 10.5. DERBY-5624 System can run out of stack space while processing DropOnCommit requests. Taking care of cleanup after a commit is handled by notifying all "Observers" that an event has taken place that they might want to act on and cleanup. In the added test case this is triggered by off line commit which effectively drops and recreates the base table and all of its indexes after loading the data into them. Sometimes these Observers may execute work which adds to the Observer queue, and that queue can "miss" them in the first pass through. A previous fix for this problem added a recursive call to notifyObservers in the place that could cause this addition of observers. This recursive call was causing stack problems when the number of Observers became large. For the checked in test case this was 1000 indexes on 1000 columns of the table. For other users I believe the cause was a by product of sorts on large disk based sorts for multi-gigabyte tables and indexes. 2 users were reporting similar failed stacks for failing compresses of large tables, and one was able to take this fix to their environment and then successfully run the compress. The fix was to remove the recursion and instead loop at the outermost point until there were no Observers. Adding the test to the largedata suite as it takes over 10 minutes to run on my machine.
          Mike Matrigali made changes -
          Affects Version/s 10.5.3.2 [ 12315436 ]
          Hide
          Dag H. Wanvik added a comment -

          I tested Derby5624Test on Solaris 11 (standalone) and it ran OK there, so it could be included in the suite for that platform, I think. It took 317s on my box.

          Show
          Dag H. Wanvik added a comment - I tested Derby5624Test on Solaris 11 (standalone) and it ran OK there, so it could be included in the suite for that platform, I think. It took 317s on my box.
          Hide
          Myrna van Lunteren added a comment -

          I reran the largedata._Suite in windows with ibm 1.6 (sr9 fp1), build sync-ed to revision 1294957, and this time, I did not see the 'table/view already exist' failure - all pass. So, that really was unrelated. Also, my run on linux with ibm 1.6 sr9fp1(skipping Derby5624Test) passed cleanly again. So the only remaining concern is the running out of file descriptors on a default configured linux system (with max no. of open files set to 1024).

          Show
          Myrna van Lunteren added a comment - I reran the largedata._Suite in windows with ibm 1.6 (sr9 fp1), build sync-ed to revision 1294957, and this time, I did not see the 'table/view already exist' failure - all pass. So, that really was unrelated. Also, my run on linux with ibm 1.6 sr9fp1(skipping Derby5624Test) passed cleanly again. So the only remaining concern is the running out of file descriptors on a default configured linux system (with max no. of open files set to 1024).
          Hide
          Myrna van Lunteren added a comment -

          I changed the maximum open file setting for one linux machine from 1024 to 2048, and then the Derby5624Test passed.

          Show
          Myrna van Lunteren added a comment - I changed the maximum open file setting for one linux machine from 1024 to 2048, and then the Derby5624Test passed.
          Hide
          Myrna van Lunteren added a comment -

          I think the fact that I saw the File already exists failure with both a run on windows and one on linux, while we have the test skipped on linux, proves that your test does not cause any harm. Apologies for not making this clear.

          Show
          Myrna van Lunteren added a comment - I think the fact that I saw the File already exists failure with both a run on windows and one on linux, while we have the test skipped on linux, proves that your test does not cause any harm. Apologies for not making this clear.
          Hide
          Mike Matrigali added a comment -

          i'd appreciate it if someone good with junit could review my new test. Most critical would be if it has the appropriate decorators. I could move the new test to the end if we think it is affecting the other tests, or alter the test to drop the table sooner (I was assuming the table would get
          dropped automatically). The new test really should not be that much of a drain on disk space, compared to the rest of the tests as it mostly
          adds 1000 small files.

          I can also back out test on windows also if we think it is causing problems there.

          Show
          Mike Matrigali added a comment - i'd appreciate it if someone good with junit could review my new test. Most critical would be if it has the appropriate decorators. I could move the new test to the end if we think it is affecting the other tests, or alter the test to drop the table sooner (I was assuming the table would get dropped automatically). The new test really should not be that much of a drain on disk space, compared to the rest of the tests as it mostly adds 1000 small files. I can also back out test on windows also if we think it is causing problems there.
          Hide
          Myrna van Lunteren added a comment -

          Hi Kristian, thanks for your response...
          I didn't mean to imply that the disk usage is related the changes for DERBY-5614 or any other recent change, apologies for the confusion.

          In comments on this bug I was referring to the full largedata._Suite (suites.All only runs largedata.LobLimitsLiteTest).

          Show
          Myrna van Lunteren added a comment - Hi Kristian, thanks for your response... I didn't mean to imply that the disk usage is related the changes for DERBY-5614 or any other recent change, apologies for the confusion. In comments on this bug I was referring to the full largedata._Suite (suites.All only runs largedata.LobLimitsLiteTest).
          Hide
          Kristian Waagan added a comment -

          A change introduced by DERBY-5614 may have cause the slowdown, but I don't see how it's related to the extra disk usage. The bug in SpawnedProcess would cause the VM to stay alive idling until the TimerTask got purged (even when cancelled), but the order of the tests matters. So it would only be relevant if a test using SpawnedProcess ran less than 45 minutes before the last test in the run finished.

          When I ran suites.All in parallell just now (4 runners), it took 1 hour and 6 minutes (including org.apache.derby.PackagePrivateTestSuite and org.apache.derbyTesting.functionTests.tests.memory._Suite). This was on Solaris 11 with JDK 7. Assuming all 4 runners are busy at all times that's a worst case time of around 4.5 hours - 11 hours definitely sounds like way too much (unless that machine is running a lot of other stuff at the same time).

          Show
          Kristian Waagan added a comment - A change introduced by DERBY-5614 may have cause the slowdown, but I don't see how it's related to the extra disk usage. The bug in SpawnedProcess would cause the VM to stay alive idling until the TimerTask got purged (even when cancelled), but the order of the tests matters. So it would only be relevant if a test using SpawnedProcess ran less than 45 minutes before the last test in the run finished. When I ran suites.All in parallell just now (4 runners), it took 1 hour and 6 minutes (including org.apache.derby.PackagePrivateTestSuite and org.apache.derbyTesting.functionTests.tests.memory._Suite). This was on Solaris 11 with JDK 7. Assuming all 4 runners are busy at all times that's a worst case time of around 4.5 hours - 11 hours definitely sounds like way too much (unless that machine is running a lot of other stuff at the same time).
          Hide
          Myrna van Lunteren added a comment -

          I've seen this failure again, this time on windows.
          It happened in the very last test:
          (emb)largedata.LobLimitsTest.test_05_ClobNegative used 707031 ms E
          The Derby5624Test passed on this windows run.

          It appears to me that even without the new test the largeDataTest suite runs slower than before on Linux.
          Perhaps this is related to changes made to the test harness recently?
          Some changes went in just today that perhaps improve this - I'll rerun and report back.

          Unfortunately, when the suite passes on my Linux machines, it takes about 11 hours or longer, and fills up the disk to up to 99% at times.
          I use a new test directory each night. So if there's a failure, things get progressively worse in subsequent tests because of left-over files filling up the disk. It's a balancing act between getting time to look at a failure and filling up the disk.

          Show
          Myrna van Lunteren added a comment - I've seen this failure again, this time on windows. It happened in the very last test: (emb)largedata.LobLimitsTest.test_05_ClobNegative used 707031 ms E The Derby5624Test passed on this windows run. It appears to me that even without the new test the largeDataTest suite runs slower than before on Linux. Perhaps this is related to changes made to the test harness recently? Some changes went in just today that perhaps improve this - I'll rerun and report back. Unfortunately, when the suite passes on my Linux machines, it takes about 11 hours or longer, and fills up the disk to up to 99% at times. I use a new test directory each night. So if there's a failure, things get progressively worse in subsequent tests because of left-over files filling up the disk. It's a balancing act between getting time to look at a failure and filling up the disk.
          Hide
          Matthew McCawley added a comment -

          I ran the patch on my databases in Ubuntu and did not have the same problem.

          Show
          Matthew McCawley added a comment - I ran the patch on my databases in Ubuntu and did not have the same problem.
          Hide
          Myrna van Lunteren added a comment -

          Oddly enough, after the latest change, I saw another failure with the largeDataTest suite:

          1) LobLimitsTestjava.sql.SQLException: Table/View 'BLOBTBL' already exists in Schema 'APP'.
          at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(Unknown Source)
          at org.apache.derby.client.am.SqlException.getSQLException(Unknown Source)
          at org.apache.derby.client.am.Statement.execute(Unknown Source)
          at org.apache.derbyTesting.functionTests.tests.largedata.LobLimitsTest.setupTables(LobLimitsTest.java:107)
          at org.apache.derbyTesting.functionTests.tests.largedata.LobLimitsTest$1.decorateSQL(LobLimitsTest.java:141)
          at org.apache.derbyTesting.junit.CleanDatabaseTestSetup.setUp(CleanDatabaseTestSetup.java:112)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:20)
          at junit.extensions.TestSetup.run(TestSetup.java:25)
          at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
          at junit.extensions.TestSetup.run(TestSetup.java:25)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
          at junit.extensions.TestSetup.run(TestSetup.java:25)
          at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
          at junit.extensions.TestSetup.run(TestSetup.java:25)
          Caused by: org.apache.derby.client.am.SqlException: Table/View 'BLOBTBL' already exists in Schema 'APP'.
          at org.apache.derby.client.am.Statement.completeSqlca(Unknown Source)
          at org.apache.derby.client.am.Statement.completeExecuteImmediate(Unknown Source)
          at org.apache.derby.client.net.NetStatementReply.parseEXCSQLIMMreply(Unknown Source)
          at org.apache.derby.client.net.NetStatementReply.readExecuteImmediate(Unknown Source)
          at org.apache.derby.client.net.StatementReply.readExecuteImmediate(Unknown Source)
          at org.apache.derby.client.net.NetStatement.readExecuteImmediate_(Unknown Source)
          at org.apache.derby.client.am.Statement.readExecuteImmediate(Unknown Source)
          at org.apache.derby.client.am.Statement.flowExecute(Unknown Source)
          at org.apache.derby.client.am.Statement.executeX(Unknown Source)
          ... 26 more

          Show
          Myrna van Lunteren added a comment - Oddly enough, after the latest change, I saw another failure with the largeDataTest suite: 1) LobLimitsTestjava.sql.SQLException: Table/View 'BLOBTBL' already exists in Schema 'APP'. at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.client.am.SqlException.getSQLException(Unknown Source) at org.apache.derby.client.am.Statement.execute(Unknown Source) at org.apache.derbyTesting.functionTests.tests.largedata.LobLimitsTest.setupTables(LobLimitsTest.java:107) at org.apache.derbyTesting.functionTests.tests.largedata.LobLimitsTest$1.decorateSQL(LobLimitsTest.java:141) at org.apache.derbyTesting.junit.CleanDatabaseTestSetup.setUp(CleanDatabaseTestSetup.java:112) at junit.extensions.TestSetup$1.protect(TestSetup.java:20) at junit.extensions.TestSetup.run(TestSetup.java:25) at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24) at junit.extensions.TestSetup$1.protect(TestSetup.java:21) at junit.extensions.TestSetup.run(TestSetup.java:25) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24) at junit.extensions.TestSetup$1.protect(TestSetup.java:21) at junit.extensions.TestSetup.run(TestSetup.java:25) at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24) at junit.extensions.TestSetup$1.protect(TestSetup.java:21) at junit.extensions.TestSetup.run(TestSetup.java:25) Caused by: org.apache.derby.client.am.SqlException: Table/View 'BLOBTBL' already exists in Schema 'APP'. at org.apache.derby.client.am.Statement.completeSqlca(Unknown Source) at org.apache.derby.client.am.Statement.completeExecuteImmediate(Unknown Source) at org.apache.derby.client.net.NetStatementReply.parseEXCSQLIMMreply(Unknown Source) at org.apache.derby.client.net.NetStatementReply.readExecuteImmediate(Unknown Source) at org.apache.derby.client.net.StatementReply.readExecuteImmediate(Unknown Source) at org.apache.derby.client.net.NetStatement.readExecuteImmediate_(Unknown Source) at org.apache.derby.client.am.Statement.readExecuteImmediate(Unknown Source) at org.apache.derby.client.am.Statement.flowExecute(Unknown Source) at org.apache.derby.client.am.Statement.executeX(Unknown Source) ... 26 more
          Hide
          Mike Matrigali added a comment -

          I will look at the linux error. I had run the test on windows and it was fine there. First thing I will look at is the compress call. It currently uses
          the default which builds all indexes in parallel, which may require all files to be open at the same time. Will see if sequential option still pops
          the bug before the fix. If I can't get a fix soon, I will disable test on non-windows.

          Show
          Mike Matrigali added a comment - I will look at the linux error. I had run the test on windows and it was fine there. First thing I will look at is the compress call. It currently uses the default which builds all indexes in parallel, which may require all files to be open at the same time. Will see if sequential option still pops the bug before the fix. If I can't get a fix soon, I will disable test on non-windows.
          Myrna van Lunteren made changes -
          Attachment largedatasuite_mod.out [ 12515645 ]
          Attachment error-stacktrace_mod.out [ 12515646 ]
          Hide
          Myrna van Lunteren added a comment -

          Uploading the output of the largedata test run which failed, and the error-stacktrace from the fail directory.
          I've modified the location of the test dir a bit.

          Show
          Myrna van Lunteren added a comment - Uploading the output of the largedata test run which failed, and the error-stacktrace from the fail directory. I've modified the location of the test dir a bit.
          Myrna van Lunteren made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          Myrna van Lunteren added a comment -

          I'm reopening, because I saw a failure in my nightly test of largeDataTests suite, looks like this test is now pushing past the # of open files supported for my account:

          (emb)largedata.Derby5624Test.testDERBY_5624 java.io.FileNotFoundException: /local1/cloudtst/dev/src/NightlyBuildResults.2012-02-21/largeDataTests/fail/Embedded_40/Derby5624Test/testDERBY_5624/derby.log (Too many open files)
          at java.io.FileOutputStream.open(Native Method)
          at java.io.FileOutputStream.<init>(FileOutputStream.java:190)
          at java.io.FileOutputStream.<init>(FileOutputStream.java:142)
          at org.apache.derbyTesting.functionTests.util.PrivilegedFileOpsForTests.copySingleFile(PrivilegedFileOpsForTests.java:305)
          at org.apache.derbyTesting.functionTests.util.PrivilegedFileOpsForTests.recursiveCopy(PrivilegedFileOpsForTests.java:261)
          at org.apache.derbyTesting.functionTests.util.PrivilegedFileOpsForTests.access$000(PrivilegedFileOpsForTests.java:49)
          at org.apache.derbyTesting.functionTests.util.PrivilegedFileOpsForTests$8.run(PrivilegedFileOpsForTests.java:235)
          at java.security.AccessController.doPrivileged(AccessController.java:251)
          at org.apache.derbyTesting.functionTests.util.PrivilegedFileOpsForTests.copy(PrivilegedFileOpsForTests.java:233)
          at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:132)

          I'll attach the full error stack trace, and the console output of the suite run.

          Show
          Myrna van Lunteren added a comment - I'm reopening, because I saw a failure in my nightly test of largeDataTests suite, looks like this test is now pushing past the # of open files supported for my account: (emb)largedata.Derby5624Test.testDERBY_5624 java.io.FileNotFoundException: /local1/cloudtst/dev/src/NightlyBuildResults.2012-02-21/largeDataTests/fail/Embedded_40/Derby5624Test/testDERBY_5624/derby.log (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:190) at java.io.FileOutputStream.<init>(FileOutputStream.java:142) at org.apache.derbyTesting.functionTests.util.PrivilegedFileOpsForTests.copySingleFile(PrivilegedFileOpsForTests.java:305) at org.apache.derbyTesting.functionTests.util.PrivilegedFileOpsForTests.recursiveCopy(PrivilegedFileOpsForTests.java:261) at org.apache.derbyTesting.functionTests.util.PrivilegedFileOpsForTests.access$000(PrivilegedFileOpsForTests.java:49) at org.apache.derbyTesting.functionTests.util.PrivilegedFileOpsForTests$8.run(PrivilegedFileOpsForTests.java:235) at java.security.AccessController.doPrivileged(AccessController.java:251) at org.apache.derbyTesting.functionTests.util.PrivilegedFileOpsForTests.copy(PrivilegedFileOpsForTests.java:233) at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:132) I'll attach the full error stack trace, and the console output of the suite run.
          Hide
          Mike Matrigali added a comment -

          checked in fix to trunk. Will let it sit there and make sure it passes tests across the nightly platforms and look at backporting it next week.

          DERBY-5624 System can run out of stack space while processing DropOnCommit reque
          sts.

          Taking care of cleanup after a commit is handled by notifying all "Observers"
          that an event has taken place that they might want to act on and cleanup. In
          the added test case this is triggered by off line commit which effectively
          drops and recreates the base table and all of its indexes after loading the
          data into them.

          Sometimes these Observers may execute work which adds to the Observer queue,
          and that queue can "miss" them in the first pass through.

          A previous fix for this problem added a recursive call to notifyObservers in
          the place that could cause this addition of observers. This recursive call
          was causing stack problems when the number of Observers became large. For
          the checked in test case this was 1000 indexes on 1000 columns of the table.
          For other users I believe the cause was a by product of sorts on large disk
          based sorts for multi-gigabyte tables and indexes. 2 users were reporting
          similar failed stacks for failing compresses of large tables, and one was
          able to take this fix to their environment and then successfully run the
          compress.

          The fix was to remove the recursion and instead loop at the outermost point
          until there were no Observers.

          Adding the test to the largedata suite as it takes over 10 minutes to run
          on my machine.

          Show
          Mike Matrigali added a comment - checked in fix to trunk. Will let it sit there and make sure it passes tests across the nightly platforms and look at backporting it next week. DERBY-5624 System can run out of stack space while processing DropOnCommit reque sts. Taking care of cleanup after a commit is handled by notifying all "Observers" that an event has taken place that they might want to act on and cleanup. In the added test case this is triggered by off line commit which effectively drops and recreates the base table and all of its indexes after loading the data into them. Sometimes these Observers may execute work which adds to the Observer queue, and that queue can "miss" them in the first pass through. A previous fix for this problem added a recursive call to notifyObservers in the place that could cause this addition of observers. This recursive call was causing stack problems when the number of Observers became large. For the checked in test case this was 1000 indexes on 1000 columns of the table. For other users I believe the cause was a by product of sorts on large disk based sorts for multi-gigabyte tables and indexes. 2 users were reporting similar failed stacks for failing compresses of large tables, and one was able to take this fix to their environment and then successfully run the compress. The fix was to remove the recursion and instead loop at the outermost point until there were no Observers. Adding the test to the largedata suite as it takes over 10 minutes to run on my machine.
          Mike Matrigali made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 10.9.0.0 [ 12316344 ]
          Resolution Fixed [ 1 ]
          Hide
          Matthew McCawley added a comment -

          Built on 10.8 and run with a large table (over 45 million rows). I'll update if something changes, but it looks like this patch fixed the issue.

          Show
          Matthew McCawley added a comment - Built on 10.8 and run with a large table (over 45 million rows). I'll update if something changes, but it looks like this patch fixed the issue.
          Hide
          Matthew McCawley added a comment -

          This seems to have fixed the problem for me. It also decreases the runtime as it doesn't have to create such a large stack. However, I was running 10.8, and I built the patch on main (10.9 beta), so I need to rebuild on 10.8 to make sure that nothing else is different as well as test on a larger database.

          Show
          Matthew McCawley added a comment - This seems to have fixed the problem for me. It also decreases the runtime as it doesn't have to create such a large stack. However, I was running 10.8, and I built the patch on main (10.9 beta), so I need to rebuild on 10.8 to make sure that nothing else is different as well as test on a larger database.
          Mike Matrigali made changes -
          Labels derby_triage10_9
          Issue & fix info High Value Fix [ 10422 ] High Value Fix,Patch Available,Repro attached [ 10422,10102,10424 ]
          Hide
          Mike Matrigali added a comment -

          I have repro'd what I think is the problem using a very un-real world test case (table with 1000 columns and an index
          on every column). It would be nice to get a more real world test case that just uses a big table and whatever the ddl on that table that is causing the issue.

          The 2 users running into this seem to be both using relatively big tables, but I have not seen ddl yet. I am wondering if the problem they are seeing is related to either on-disk sort files or blobs.

          Show
          Mike Matrigali added a comment - I have repro'd what I think is the problem using a very un-real world test case (table with 1000 columns and an index on every column). It would be nice to get a more real world test case that just uses a big table and whatever the ddl on that table that is causing the issue. The 2 users running into this seem to be both using relatively big tables, but I have not seen ddl yet. I am wondering if the problem they are seeing is related to either on-disk sort files or blobs.
          Mike Matrigali made changes -
          Field Original Value New Value
          Attachment DERBY-5624.patch [ 12515284 ]
          Hide
          Mike Matrigali added a comment -

          initial proof on concept patch, no extensive tests run yet. Passes store test for original 3993 error. The included test
          failed before the fix and passes after the fix. I probably will alter the test to be less of a resource hog, or look at putting it into a different suite. It currently creates a table with 1000 columns and an index on each of those columns, and then runs compress table.

          Show
          Mike Matrigali added a comment - initial proof on concept patch, no extensive tests run yet. Passes store test for original 3993 error. The included test failed before the fix and passes after the fix. I probably will alter the test to be less of a resource hog, or look at putting it into a different suite. It currently creates a table with 1000 columns and an index on each of those columns, and then runs compress table.
          Mike Matrigali created issue -

            People

            • Assignee:
              Mike Matrigali
              Reporter:
              Mike Matrigali
            • Votes:
              3 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development