Derby
  1. Derby
  2. DERBY-2949

AssertionFailedError in testStalePlansOnLargeTable

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 10.3.1.4
    • Fix Version/s: 10.7.1.4, 10.8.1.2
    • Component/s: Test
    • Labels:
      None
    • Environment:
    • Bug behavior facts:
      Regression Test Failure

      Description

      testStalePlansOnLargeTable(org.apache.derbyTesting.functionTests.tests.lang.StalePlansTest)junit.framework.AssertionFailedError
      at org.apache.derbyTesting.functionTests.tests.lang.StalePlansTest.testStalePlansOnLargeTable(StalePlansTest.java:264)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:88)
      at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
      at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
      at junit.extensions.TestSetup.run(TestSetup.java:25)
      at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
      at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
      at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
      at junit.extensions.TestSetup.run(TestSetup.java:25)
      at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)

      1. d2949.diff
        4 kB
        Knut Anders Hatlen
      2. expected-plan.txt
        2 kB
        Knut Anders Hatlen
      3. plan.txt
        3 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          [bulk update] Close all resolved issues that haven't been updated for more than one year.

          Show
          Knut Anders Hatlen added a comment - [bulk update] Close all resolved issues that haven't been updated for more than one year.
          Hide
          Knut Anders Hatlen added a comment -

          Committed to trunk with revision 1043389.
          Committed to 10.7 with revision 1043392.

          Show
          Knut Anders Hatlen added a comment - Committed to trunk with revision 1043389. Committed to 10.7 with revision 1043392.
          Hide
          Knut Anders Hatlen added a comment -

          Attached is a patch that changes StalePlansTest in the following ways:

          suite():

          • Remove the explicit setting of checkpoint interval since it doesn't appear to serve any purpose.
          • Remove creation of the flusher table from the decorator since it's not used by the test anymore.

          flushRowCount():

          • Invoke a checkpoint instead of scanning the flusher table to update the row count estimates.

          testStalePlansOnLargeTable():

          • Increase the number of rows added after the first execution of the select statement, so that the row count change is big enough to trigger a recompilation at the next stale plan check interval.
          • Update assertions to match other values returned by the queries because of the added rows.

          I've run StalePlansTest successfully 2000 times with the patch. When I ran the test 2000 times without the patch, it failed 50 times.

          Show
          Knut Anders Hatlen added a comment - Attached is a patch that changes StalePlansTest in the following ways: suite(): Remove the explicit setting of checkpoint interval since it doesn't appear to serve any purpose. Remove creation of the flusher table from the decorator since it's not used by the test anymore. flushRowCount(): Invoke a checkpoint instead of scanning the flusher table to update the row count estimates. testStalePlansOnLargeTable(): Increase the number of rows added after the first execution of the select statement, so that the row count change is big enough to trigger a recompilation at the next stale plan check interval. Update assertions to match other values returned by the queries because of the added rows. I've run StalePlansTest successfully 2000 times with the patch. When I ran the test 2000 times without the patch, it failed 50 times.
          Hide
          Knut Anders Hatlen added a comment -

          It also looks like it will be more reliable to invoke a checkpoint explicitly. Even if we make the flusher table bigger or the page cache smaller, filling up the page cache with the flusher table does not mean every dirty page will be written to disk immediately. The page cache will just schedule work with the background worker thread, so we'd have to wait for some time in order to be sure that the row counts have actually been updated. SYSCS_CHECKPOINT_DATABASE, on the other hand, won't return until the dirty pages have been written to disk.

          If I only change flushRowCount() to call SYSCS_CHECKPOINT_DATABASE(), testStalePlansOnLargeTable() starts failing consistently. The problem appears to be that the row count change is not significant enough to trigger a recompile at a time where the test expects the plan to change. A significant change is 10% when the table has more than 400 rows (see BaseActivation.informOfRowCount()). The actual change in the row count at that point was less than 1%, but it used to work before because the initial row count estimate was inaccurate and the change therefore appeared to be larger. I think the solution is to add more rows to the table to make sure the row count changes sufficiently to always trigger a recompile.

          Show
          Knut Anders Hatlen added a comment - It also looks like it will be more reliable to invoke a checkpoint explicitly. Even if we make the flusher table bigger or the page cache smaller, filling up the page cache with the flusher table does not mean every dirty page will be written to disk immediately. The page cache will just schedule work with the background worker thread, so we'd have to wait for some time in order to be sure that the row counts have actually been updated. SYSCS_CHECKPOINT_DATABASE, on the other hand, won't return until the dirty pages have been written to disk. If I only change flushRowCount() to call SYSCS_CHECKPOINT_DATABASE(), testStalePlansOnLargeTable() starts failing consistently. The problem appears to be that the row count change is not significant enough to trigger a recompile at a time where the test expects the plan to change. A significant change is 10% when the table has more than 400 rows (see BaseActivation.informOfRowCount()). The actual change in the row count at that point was less than 1%, but it used to work before because the initial row count estimate was inaccurate and the change therefore appeared to be larger. I think the solution is to add more rows to the table to make sure the row count changes sufficiently to always trigger a recompile.
          Hide
          Lily Wei added a comment -

          +1 for making test behaves predictable. Tests are meant to raise potential issues not to add variation of the issue. I can see leaning toward to selects from the FLUSHER table with explicit checkpoints is a good balance for this case. However, the goal of the test was very good. I am not aware of whether we have test to fill the page cache with new data, evicting all pages currently in the cache so the row count changes are written at a known point in time. And, it will be a good test to have in turn of code coverage point of view.

          Show
          Lily Wei added a comment - +1 for making test behaves predictable. Tests are meant to raise potential issues not to add variation of the issue. I can see leaning toward to selects from the FLUSHER table with explicit checkpoints is a good balance for this case. However, the goal of the test was very good. I am not aware of whether we have test to fill the page cache with new data, evicting all pages currently in the cache so the row count changes are written at a known point in time. And, it will be a good test to have in turn of code coverage point of view.
          Hide
          Knut Anders Hatlen added a comment -

          It may be possible to improve the accuracy/predictability of the row count estimate, but for this issue I think we should just try to make the test behave predictably.

          The test frequently executes the query "select count(c1) from flusher". The comments say this about its purpose:

          // Create and populate a table to be used for flushing the
          // cache. Flushing the cache causes all row count changes to be
          // written, which is necessary for the results of this test to
          // be stable (because otherwise the row count changes would be
          // written asynchronously)

          It sounds like the query is meant to fill the page cache with new data, evicting all pages currently in the cache, so that the row count changes are written at a know point in time. So the net effect, at least with respect to the row count estimate, should be the same as what we get from a checkpoint. And it sounds like it was added in order to avoid exactly this kind of intermittent failures.

          However, I don't think the code works as intended because the FLUSHER table is not big enough to flush all pages from the page cache. The table must be at least as big as the page cache in order to achieve that, so the test needs to increase the size of the table or reduce the size of the page cache (or both). Alternatively, we could replace the selects from the FLUSHER table with explicit checkpoints. I'm leaning towards the latter option (using checkpoints to flush the estimates), since that sounds like the simpler one (no need to know exact size of the page cache or the exact number of pages in the FLUSHER table).

          Show
          Knut Anders Hatlen added a comment - It may be possible to improve the accuracy/predictability of the row count estimate, but for this issue I think we should just try to make the test behave predictably. The test frequently executes the query "select count(c1) from flusher". The comments say this about its purpose: // Create and populate a table to be used for flushing the // cache. Flushing the cache causes all row count changes to be // written, which is necessary for the results of this test to // be stable (because otherwise the row count changes would be // written asynchronously) It sounds like the query is meant to fill the page cache with new data, evicting all pages currently in the cache, so that the row count changes are written at a know point in time. So the net effect, at least with respect to the row count estimate, should be the same as what we get from a checkpoint. And it sounds like it was added in order to avoid exactly this kind of intermittent failures. However, I don't think the code works as intended because the FLUSHER table is not big enough to flush all pages from the page cache. The table must be at least as big as the page cache in order to achieve that, so the test needs to increase the size of the table or reduce the size of the page cache (or both). Alternatively, we could replace the selects from the FLUSHER table with explicit checkpoints. I'm leaning towards the latter option (using checkpoints to flush the estimates), since that sounds like the simpler one (no need to know exact size of the page cache or the exact number of pages in the FLUSHER table).
          Hide
          Knut Anders Hatlen added a comment -

          Kristian gave me some hints as to where to look... Thanks!

          It looks like we have two mechanisms for updating the row count estimate:

          1) Full table scans will set the correct row count when they complete.

          2) When a dirty page is written to disk (for example in a checkpoint) the row count estimate is updated with the delta between the current row count and the row count at the time the page was last read from or written to disk.

          But those two mechanisms don't know about each other, so when we're adding a delta to the estimate while writing the page, that delta may include rows that have already been accounted for by a full table scan, and those changes will end up being applied twice to the estimate.

          For example I see this when accessing a table that contains 10000 rows:

          1. First do SELECT COUNT FROM T, and the row count will be correctly set to 10000

          2. Delete every other row in the table (DELETE FROM T WHERE MOD(ID, 2) = 0), the row count estimate won't be changed

          3. Do SELECT COUNT FROM T again, and the estimate will be changed to 5000 (which is correct)

          4. Invoke a checkpoint, which will write all the pages in the table. It will detect that each page has only half the number of rows they had when they were read into the cache in (1), and it will therefore reduce the estimate by another 5000 rows, so that the estimate now is 0 rows, even though the table still contains 5000 rows

          Show
          Knut Anders Hatlen added a comment - Kristian gave me some hints as to where to look... Thanks! It looks like we have two mechanisms for updating the row count estimate: 1) Full table scans will set the correct row count when they complete. 2) When a dirty page is written to disk (for example in a checkpoint) the row count estimate is updated with the delta between the current row count and the row count at the time the page was last read from or written to disk. But those two mechanisms don't know about each other, so when we're adding a delta to the estimate while writing the page, that delta may include rows that have already been accounted for by a full table scan, and those changes will end up being applied twice to the estimate. For example I see this when accessing a table that contains 10000 rows: 1. First do SELECT COUNT FROM T, and the row count will be correctly set to 10000 2. Delete every other row in the table (DELETE FROM T WHERE MOD(ID, 2) = 0), the row count estimate won't be changed 3. Do SELECT COUNT FROM T again, and the estimate will be changed to 5000 (which is correct) 4. Invoke a checkpoint, which will write all the pages in the table. It will detect that each page has only half the number of rows they had when they were read into the cache in (1), and it will therefore reduce the estimate by another 5000 rows, so that the estimate now is 0 rows, even though the table still contains 5000 rows
          Hide
          Knut Anders Hatlen added a comment -

          I see that if I call SYSCS_UTIL.SYSCS_CHECKPOINT_DATABASE right before the loop that performs the last 11 executions of the query, the assert fails consistently. I'm not sure exactly how the checkpoint interacts with the optimizer, though. Perhaps it's doing something with the row count estimates?

          The reason why I tried adding a checkpoint at that point in the code, was that the test sets derby.storage.checkpointInterval=100000. There aren't any comments saying why it does that, neither in StalePlansTest nor in the original staleplans.sql, but I suppose it is to make the checkpoints happen at the spots where it's known to be safe and/or needed.

          I also see that the assert fails consistently if I effectively disable checkpointing for the test by setting derby.storage.checkpointInterval to the maximum allowed value (128*1024*1024). If I run the test with the default settings, I see that the background worker thread invokes a checkpoint when testStalePlansOnLargeTable has loaded the rows into the table. I suppose, depending on timing, this checkpoint could end up running at a time where it interferes with the optimizer's decisions.

          Would it make sense to disable automatic checkpointing for the test and add a call to SYSCS_CHECKPOINT_DATABASE after the table has been populated? Then at least we know that the checkpoint has been performed at the expected spot, and that it has completed before we start executing the statements whose plans we want to check.

          And if someone could shed some light on why the checkpoint affects the choices of the optimizer, that would be appreciated too...

          Show
          Knut Anders Hatlen added a comment - I see that if I call SYSCS_UTIL.SYSCS_CHECKPOINT_DATABASE right before the loop that performs the last 11 executions of the query, the assert fails consistently. I'm not sure exactly how the checkpoint interacts with the optimizer, though. Perhaps it's doing something with the row count estimates? The reason why I tried adding a checkpoint at that point in the code, was that the test sets derby.storage.checkpointInterval=100000. There aren't any comments saying why it does that, neither in StalePlansTest nor in the original staleplans.sql, but I suppose it is to make the checkpoints happen at the spots where it's known to be safe and/or needed. I also see that the assert fails consistently if I effectively disable checkpointing for the test by setting derby.storage.checkpointInterval to the maximum allowed value (128*1024*1024). If I run the test with the default settings, I see that the background worker thread invokes a checkpoint when testStalePlansOnLargeTable has loaded the rows into the table. I suppose, depending on timing, this checkpoint could end up running at a time where it interferes with the optimizer's decisions. Would it make sense to disable automatic checkpointing for the test and add a call to SYSCS_CHECKPOINT_DATABASE after the table has been populated? Then at least we know that the checkpoint has been performed at the expected spot, and that it has completed before we start executing the statements whose plans we want to check. And if someone could shed some light on why the checkpoint affects the choices of the optimizer, that would be appreciated too...
          Hide
          Knut Anders Hatlen added a comment -

          Thanks Lily, that's a very useful observation! The query has been executed 12 times at the point of the failing assert. So it seems like in the good case the query is recompiled after the tenth execution (the test runs with derby.language.stalePlanCheckInterval=10, so that sounds reasonable). In the failing case, however, no recompilation has happened, and all 12 executions have used the same plan.

          Show
          Knut Anders Hatlen added a comment - Thanks Lily, that's a very useful observation! The query has been executed 12 times at the point of the failing assert. So it seems like in the good case the query is recompiled after the tenth execution (the test runs with derby.language.stalePlanCheckInterval=10, so that sounds reasonable). In the failing case, however, no recompilation has happened, and all 12 executions have used the same plan.
          Hide
          Lily Wei added a comment -

          Hi Knut:
          I notice different in "Number of opens" and "Rows input". Please see below. Will these two numbers cause the instability? Will the number be exactly the same since the data is the same? What do you think?
          =============
          Expected-plan.txt
          ....
          Source result set:
          ...
          Number of opens = 2
          Rows input = 2068
          ...
          Index Key Optimization = false
          ..
          Number of opens = 2
          Rows seen = 2068
          ...
          Plan.txt
          ...
          Source result set:
          ...
          Number of opens = 12
          Rows input = 11375
          ...
          Index Key Optimization = false
          ..
          Number of opens =1 2
          Rows seen = 11375

          Show
          Lily Wei added a comment - Hi Knut: I notice different in "Number of opens" and "Rows input". Please see below. Will these two numbers cause the instability? Will the number be exactly the same since the data is the same? What do you think? ============= Expected-plan.txt .... Source result set: ... Number of opens = 2 Rows input = 2068 ... Index Key Optimization = false .. Number of opens = 2 Rows seen = 2068 ... Plan.txt ... Source result set: ... Number of opens = 12 Rows input = 11375 ... Index Key Optimization = false .. Number of opens =1 2 Rows seen = 11375
          Hide
          Knut Anders Hatlen added a comment -

          For completeness, I'm also attaching a copy of the expected plan (expected-plan.txt).

          Show
          Knut Anders Hatlen added a comment - For completeness, I'm also attaching a copy of the expected plan (expected-plan.txt).
          Hide
          Knut Anders Hatlen added a comment -

          I managed to get hold of the query plan that was chosen when the test failed. See the attached file plan.txt. I have no idea what's causing this instability, though.

          Show
          Knut Anders Hatlen added a comment - I managed to get hold of the query plan that was chosen when the test failed. See the attached file plan.txt. I have no idea what's causing this instability, though.
          Hide
          Dag H. Wanvik added a comment -

          Saw it again today, OpenSolaris snv_146, java version "1.6.0_16", Java HotSpot(TM) Server VM (build 14.2-b01, mixed mode)

          Show
          Dag H. Wanvik added a comment - Saw it again today, OpenSolaris snv_146, java version "1.6.0_16", Java HotSpot(TM) Server VM (build 14.2-b01, mixed mode)
          Hide
          Knut Anders Hatlen added a comment -

          Saw this failure again on the following platform:

          OpenSolaris 2010.05 snv_134b X86
          java version "1.6.0_18"
          Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
          Java HotSpot(TM) Server VM (build 16.0-b13, mixed mode)

          Show
          Knut Anders Hatlen added a comment - Saw this failure again on the following platform: OpenSolaris 2010.05 snv_134b X86 java version "1.6.0_18" Java(TM) SE Runtime Environment (build 1.6.0_18-b07) Java HotSpot(TM) Server VM (build 16.0-b13, mixed mode)

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Henri van de Scheur
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development