Derby
  1. Derby
  2. DERBY-1902

Intermittent failures in predicatePushdown.sql

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 10.3.1.4
    • Fix Version/s: None
    • Component/s: SQL, Test
    • Labels:
      None
    • Environment:
      Seen on both Solaris 10 and Linux on 2-CPU Opteron boxes, disk cache off
    • Bug behavior facts:
      Regression Test Failure

      Description

      For the last week, there have been intermittent failures in the night test in lang/predicatePushdown.sql. There is a plan diff which starts as follows:

                      • Diff file derbyall/derbylang/predicatePushdown.diff
          • Start: predicatePushdown jdk1.5.0_07 derbyall:derbylang 2006-09-29 00:39:36 ***
            4593 del
            < Hash Join ResultSet:
            4593a4593
            > Nested Loop Join ResultSet:

      I did not find any changes that seem relevant before the first failing night test.

      This test has not failed in the tinderbox test which runs on a computer with the disk cache on. For both computers where the failure is seen, the disk cache has been turned off. Hence, it may be that another plan is picked because of slower I/O.

      1. derbylang.zip
        441 kB
        Øystein Grøvlen

        Issue Links

          Activity

          Hide
          A B added a comment -

          Is it possible to post the files for the run when it fails? And do you have the svn number for the first time this failure was seen?

          Interesting observation about the disk cache--thanks for bringing that up. Is it feasible to turn disk cache on for one of the machines that's failing and run the tests for a couple of days to see if that makes a difference? Thanks Øystein...

          Show
          A B added a comment - Is it possible to post the files for the run when it fails? And do you have the svn number for the first time this failure was seen? Interesting observation about the disk cache--thanks for bringing that up. Is it feasible to turn disk cache on for one of the machines that's failing and run the tests for a couple of days to see if that makes a difference? Thanks Øystein...
          Hide
          Øystein Grøvlen added a comment -

          derbylang.zip contains all files from one failing run of lang/predicatePushdown.sql.

          Show
          Øystein Grøvlen added a comment - derbylang.zip contains all files from one failing run of lang/predicatePushdown.sql.
          Hide
          Øystein Grøvlen added a comment -

          The failure was first observed in the nightly run on svn revision 449444. It was not observed on 448277. However, the failure is intermittent so one cannot be sure that it was introduced between those revisions.

          Show
          Øystein Grøvlen added a comment - The failure was first observed in the nightly run on svn revision 449444. It was not observed on 448277. However, the failure is intermittent so one cannot be sure that it was introduced between those revisions.
          Hide
          Øystein Grøvlen added a comment -

          I turned the disk cache back on the Solaris x86 box, and the test still fails. (The test duration became significantly shorter so I think the disk cache setting was really changed.) Hence, the disk cache seems not to be causing this. It is strange that the tinderbox test which run on a similar computer never experiences this failure.

          Show
          Øystein Grøvlen added a comment - I turned the disk cache back on the Solaris x86 box, and the test still fails. (The test duration became significantly shorter so I think the disk cache setting was really changed.) Hence, the disk cache seems not to be causing this. It is strange that the tinderbox test which run on a similar computer never experiences this failure.
          Hide
          A B added a comment -

          > I turned the disk cache back on the Solaris x86 box, and the test still fails.

          Thanks for trying that out and for reporting the results, Øystein. I took a look at the diffs and it basically comes down to 4 queries where we expect the optimizer to choose a hash join but it chooses a nested loop join instead. This is confusing to me since the derby.optimizer.noTimeout=true property is set, which theoretically means that the optimizer should be choosing the same plan every time.

          Unfortunately, I don't have much more to offer on this right now. I do know that during optimization a hash join can be skipped/rejected if the optimizer thinks that the resultant in-memory hash table would be too big. The notion of "too big" can be set by using the maxMemoryPerTable property, but since that isn't specified for the test, it shouldn't make a difference in this case.

          I did notice that, when trying to determine of the hash table is too large, the computation depends on the value returned from HashJoinStrategy.maxCapacity(), which makes a call to ClassSize.estimateHashEntrySize(), which in turn uses a variable "refSize" that is set based on total and free memory available to the Java runtime:

          // Figure out whether this is a 32 or 64 bit machine.
          Runtime runtime = Runtime.getRuntime();
          long memBase = runtime.totalMemory() - runtime.freeMemory();
          Object[] junk = new Object[10000];
          long memUsed = runtime.totalMemory() - runtime.freeMemory() - memBase;
          int sz = (int)((memUsed + junk.length/2)/junk.length);
          refSize = ( 4 > sz) ? 4 : sz;
          minObjectSize = 4*refSize;

          It's quite a long shot, but maybe that has something to do with the different results on different machines...? Do the machines have the same amount of memory available to them when the test is run?

          I admit this is a pretty far-fetched guess, but that's all I can think of at the moment...

          Show
          A B added a comment - > I turned the disk cache back on the Solaris x86 box, and the test still fails. Thanks for trying that out and for reporting the results, Øystein. I took a look at the diffs and it basically comes down to 4 queries where we expect the optimizer to choose a hash join but it chooses a nested loop join instead. This is confusing to me since the derby.optimizer.noTimeout=true property is set, which theoretically means that the optimizer should be choosing the same plan every time. Unfortunately, I don't have much more to offer on this right now. I do know that during optimization a hash join can be skipped/rejected if the optimizer thinks that the resultant in-memory hash table would be too big. The notion of "too big" can be set by using the maxMemoryPerTable property, but since that isn't specified for the test, it shouldn't make a difference in this case. I did notice that, when trying to determine of the hash table is too large, the computation depends on the value returned from HashJoinStrategy.maxCapacity(), which makes a call to ClassSize.estimateHashEntrySize(), which in turn uses a variable "refSize" that is set based on total and free memory available to the Java runtime: // Figure out whether this is a 32 or 64 bit machine. Runtime runtime = Runtime.getRuntime(); long memBase = runtime.totalMemory() - runtime.freeMemory(); Object[] junk = new Object [10000] ; long memUsed = runtime.totalMemory() - runtime.freeMemory() - memBase; int sz = (int)((memUsed + junk.length/2)/junk.length); refSize = ( 4 > sz) ? 4 : sz; minObjectSize = 4*refSize; It's quite a long shot, but maybe that has something to do with the different results on different machines...? Do the machines have the same amount of memory available to them when the test is run? I admit this is a pretty far-fetched guess, but that's all I can think of at the moment...
          Hide
          Knut Anders Hatlen added a comment -

          What are the expected values for 32/64 bit machines? I get refSize=4 and minObjectSize=16 for both 64 and 32 bit JVMs (jdk1.5 and jdk1.6 on Solaris). In many cases I actually get sz=0 because the memory usage is identical before and after "new Object[10000]", and refSize just defaults to 4.

          Show
          Knut Anders Hatlen added a comment - What are the expected values for 32/64 bit machines? I get refSize=4 and minObjectSize=16 for both 64 and 32 bit JVMs (jdk1.5 and jdk1.6 on Solaris). In many cases I actually get sz=0 because the memory usage is identical before and after "new Object [10000] ", and refSize just defaults to 4.
          Hide
          Knut Anders Hatlen added a comment -

          I tried to hard code refSize=8 (I assume this is the expected value for 64 bit machines?) and then predicatePushdown failed consistently on both 32 and 64 bit machines.

          Show
          Knut Anders Hatlen added a comment - I tried to hard code refSize=8 (I assume this is the expected value for 64 bit machines?) and then predicatePushdown failed consistently on both 32 and 64 bit machines.
          Hide
          Andrew McIntyre added a comment -

          Unsetting Fix Version on unassigned issues.

          Show
          Andrew McIntyre added a comment - Unsetting Fix Version on unassigned issues.
          Hide
          Mike Matrigali added a comment -

          I saw this also on a 64 windows 2003 machine running the 1st 10.3 release candidate against an 1bm 142 jvm.

          Show
          Mike Matrigali added a comment - I saw this also on a 64 windows 2003 machine running the 1st 10.3 release candidate against an 1bm 142 jvm.
          Hide
          Kathey Marsden added a comment -

          Is the changed plan really a product regression or just a RegressionTestFailure? Can the regression checkbox be unchecked?

          Show
          Kathey Marsden added a comment - Is the changed plan really a product regression or just a RegressionTestFailure? Can the regression checkbox be unchecked?
          Hide
          Kathey Marsden added a comment -

          Unchecking Regession checkbox as this is not a confirmed product regression

          Show
          Kathey Marsden added a comment - Unchecking Regession checkbox as this is not a confirmed product regression
          Hide
          Jørgen Løland added a comment -

          I saw this on Solaris 10/x86 on revision 635860:

              • Start: predicatePushdown jdk1.5.0_06 derbyall:derbylang 2008-03-11 15:34:04
                ***
                4639 del
                < Hash Join ResultSet:
                4639a4639
                > Nested Loop Join ResultSet:
                4718 del
                < Hash Table ResultSet (11):
                4718a4718
                > Sort ResultSet:
                4720 del
                < Hash table size = 100
                4721 del
                < Hash key is column number 0
                4722 del
                < Rows seen = 104
                4723 del
                < Rows filtered = 0

          ... and so on. I have the entire test directory intact if requested.

          Show
          Jørgen Løland added a comment - I saw this on Solaris 10/x86 on revision 635860: Start: predicatePushdown jdk1.5.0_06 derbyall:derbylang 2008-03-11 15:34:04 *** 4639 del < Hash Join ResultSet: 4639a4639 > Nested Loop Join ResultSet: 4718 del < Hash Table ResultSet (11): 4718a4718 > Sort ResultSet: 4720 del < Hash table size = 100 4721 del < Hash key is column number 0 4722 del < Rows seen = 104 4723 del < Rows filtered = 0 ... and so on. I have the entire test directory intact if requested.
          Hide
          Mamta A. Satoor added a comment -

          Saw this on trunk (revision 645005) with ibm16

          Show
          Mamta A. Satoor added a comment - Saw this on trunk (revision 645005) with ibm16
          Hide
          Kathey Marsden added a comment -

          This may be fixed with the addition of gc() to the reference size calculation, DERBY-3731. We should monitor and see if it comes up again soon. If not perhaps this issue can be closed.

          Show
          Kathey Marsden added a comment - This may be fixed with the addition of gc() to the reference size calculation, DERBY-3731 . We should monitor and see if it comes up again soon. If not perhaps this issue can be closed.
          Hide
          Kathey Marsden added a comment -

          I wonder if it would make sense to change derby.language.maxMemoryPerTable to 2048 for this test to avoid this intermittent failure.

          Show
          Kathey Marsden added a comment - I wonder if it would make sense to change derby.language.maxMemoryPerTable to 2048 for this test to avoid this intermittent failure.
          Hide
          Kristian Waagan added a comment -

          Before we change maxMemoryPerTable, I'd like to try another algorithm under DERBY-3731. I'll try to make a patch, hopefully tomorrow, latest by next week.

          Show
          Kristian Waagan added a comment - Before we change maxMemoryPerTable, I'd like to try another algorithm under DERBY-3731 . I'll try to make a patch, hopefully tomorrow, latest by next week.
          Hide
          Mike Matrigali added a comment -

          I am ok with upping the maxMemory to get this test to be reproducible on multiple systems.

          I also wonder if it would be ok to look at upping the default for this in the whole system. That default was chosen
          a number of years ago. Derby query processing performs WAY better whenever it can use hash join vs. other
          types of join. These defaults are hard to choose as there will always be someone that the default is not right for.
          I would argue we should only change the default in a major new release, and never backport the change to
          an already released version.

          This is a standard memory for performance tradeoff. The memory usage is only for the time it takes to process
          the query, and is then released.

          I am not sure the right number, thinking about proposing 10meg vs. 1 meg. The default buffer cache size use
          to be under 100 pages and has since been updated to be 1000 pages. should this be a hard vs. soft upgrade
          thing?

          This should probably be handled under a separate jira. Along with this someone should look and see if this
          number should be coordinated with any other similar number in query processing.

          Show
          Mike Matrigali added a comment - I am ok with upping the maxMemory to get this test to be reproducible on multiple systems. I also wonder if it would be ok to look at upping the default for this in the whole system. That default was chosen a number of years ago. Derby query processing performs WAY better whenever it can use hash join vs. other types of join. These defaults are hard to choose as there will always be someone that the default is not right for. I would argue we should only change the default in a major new release, and never backport the change to an already released version. This is a standard memory for performance tradeoff. The memory usage is only for the time it takes to process the query, and is then released. I am not sure the right number, thinking about proposing 10meg vs. 1 meg. The default buffer cache size use to be under 100 pages and has since been updated to be 1000 pages. should this be a hard vs. soft upgrade thing? This should probably be handled under a separate jira. Along with this someone should look and see if this number should be coordinated with any other similar number in query processing.
          Hide
          Ole Solberg added a comment -

          A slightly different signature has occurred on the 10.4 branch, Linux 2.6.18-53.1.14.el5, Sun Jvm 1.6
          since 2008-07-04 / 674085.
          See e.g.: http://dbtg.thresher.com/derby/test/10.4Branch/jvm1.6/testing/testlog/lin/685954-derbyall_diff.txt

          Show
          Ole Solberg added a comment - A slightly different signature has occurred on the 10.4 branch, Linux 2.6.18-53.1.14.el5, Sun Jvm 1.6 since 2008-07-04 / 674085. See e.g.: http://dbtg.thresher.com/derby/test/10.4Branch/jvm1.6/testing/testlog/lin/685954-derbyall_diff.txt
          Hide
          Knut Anders Hatlen added a comment -

          Triaged for 10.5.2. Duplicate of DERBY-1720.

          Show
          Knut Anders Hatlen added a comment - Triaged for 10.5.2. Duplicate of DERBY-1720 .

            People

            • Assignee:
              Unassigned
              Reporter:
              Øystein Grøvlen
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development