Derby
  1. Derby
  2. DERBY-2130

Optimizer performance slowdown from 10.1 to 10.2

    Details

    • Urgency:
      Normal
    • Issue & fix info:
      High Value Fix, Newcomer, Repro attached
    • Bug behavior facts:
      Performance, Regression

      Description

      Attached is 'repro.sql', an IJ script which demonstrates what I
      believe to be a serious performance issue in the Optimizer.

      I have run this script in a number of configurations:

      • 10.1.2.1: the script runs successfully. The 'prepare' statement
        takes about 90 seconds, on a fairly powerful Windows machine
      • 10.1.3.1: the script produces a NPE. I believe this is DERBY-1777
      • 10.2.1.8/trunk: the script runs successfully. The 'prepare' statement
        often takes about 220 seconds, on the same Windows machine

      Intermittently, on 10.2 and on the trunk, the prepare statement takes
      15+ minutes. I cannot reliably reproduce this; I run the same script
      several times in a row and I cannot predict whether it will take 220
      seconds or whether it will take 15+ minutes.

      I am quite motivated to work on this problem, as this is blocking me from
      using Derby for a project that I'm quite keen on, but I need some
      suggestions and ideas about how to attack it. From my perspective
      there are 3 primary topics:

      1) Why did optimizer performance for this query degrade so significantly
      from 10.1.2.1 to 10.2? The optimizer seems to be at least 2.5 times slower,
      for this particular query at least, in 10.2. Sometimes it is 10x slower.

      2) What is the source of the non-determinism? Why does the optimizer
      often take 4 minutes to optimize this query on the trunk, but sometimes
      take 15+ minutes? I don't believe that I'm changing anything from
      run to run.

      3) Can we improve the optimizer performance even beyond what it was
      for 10.1.2? I realize that this is an ugly query, but I was hoping to
      see an optimization time of 5-10 seconds, not 90 seconds (and certainly
      not 220 seconds).

      I have attempted to start answering some of these questions, with
      limited success. Here is some of what I think I've discovered so far:

      • the optimizer changes in 10.2 seem to have given the optimizer many
        more choices of possible query plans to consider. I think this means
        that, if the optimizer does not time out, it will spend substantially
        more time optimizing because there are more choices to evaluate. Does
        this by itself mean that the optimizer will take 2.5 times longer in
        10.2 than it did in 10.1?
      • something about this query seems to make the costing mechanism go
        haywire, and produce extreme costs. While stepping through the
        optimization of this query in the debugger I have seen it compute
        costs like 1e63 and 1e200. This might be very closely related to
        DERBY-1905, although I don't think I'm doing any subqueries here.
        But maybe I'm misunderstanding the term "subquery" in DERBY-1905.
        At any rate, due to the enormous estimated costs, timeout does not
        occur.
      • the WHERE clause in this query is converted during compilation to
        an equivalent IN clause, I believe, which then causes me to run into
        a number of the problems described in DERBY-47 and DERBY-713.
        Specifically, rather than constructing a plan which involves 4
        index probes for the 4 WHERE clause values, the optimizer decides
        that an index scan must be performed and that it will have to process
        the entire index (because the query uses parameter markers, not
        literal values). So perhaps solving DERBY-47 would help me
      • the optimizer in fact comes up with a "decent" query plan quite quickly.
        I have experimented with placing a hard limit into the optimizer
        timeout code, so that I can force optimization to stop after an
        arbitrary fixed period of time. Then I have been able to set that
        value to as low as 1 second, and the optimizer has produced plans
        that then execute in a few milliseconds. Of course, I have only tried
        this with a trivial amount of data in my database, so it's possible
        that the plan produced by the optimizer after just a second of
        optimizing is in fact poor, and I'm just not noticing it because my
        data sizes are so small.

      At this point, what would be really helpful to me would be some suggestions
      about some general approaches or techniques to try to start breaking down
      and analyzing this problem.

      1. repro.sql
        9 kB
        Bryan Pendleton
      2. plan10_1_2_1.txt
        79 kB
        Bryan Pendleton
      3. plan10_2.txt
        78 kB
        Bryan Pendleton
      4. plans.diff
        13 kB
        Bryan Pendleton
      5. jumpReset.patch
        2 kB
        A B
      6. jumpReset_10_11_trunk_diff.txt
        2 kB
        Kathey Marsden

        Issue Links

          Activity

          Myrna van Lunteren made changes -
          Fix Version/s 10.10.2.0 [ 12326659 ]
          Fix Version/s 10.10.1.4 [ 12324247 ]
          Mike Matrigali made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Fix Version/s 10.10.1.4 [ 12324247 ]
          Resolution Fixed [ 1 ]
          Hide
          Mike Matrigali added a comment -

          backported from trunk to 10.10, do not plan on backporting any more for now - but see no reason it could not be.

          Show
          Mike Matrigali added a comment - backported from trunk to 10.10, do not plan on backporting any more for now - but see no reason it could not be.
          Hide
          ASF subversion and git services added a comment -

          Commit 1572351 from mikem@apache.org in branch 'code/branches/10.10'
          [ https://svn.apache.org/r1572351 ]

          DERBY-2130 - Optimizer performance slowdown from 10.1 to 10.2

          backporting change #1534465 from trunk to 10.10

          Patch contributed by Army Brown

          Show
          ASF subversion and git services added a comment - Commit 1572351 from mikem@apache.org in branch 'code/branches/10.10' [ https://svn.apache.org/r1572351 ] DERBY-2130 - Optimizer performance slowdown from 10.1 to 10.2 backporting change #1534465 from trunk to 10.10 Patch contributed by Army Brown
          Mike Matrigali made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          Mike Matrigali added a comment -

          just reopening while working on backport to 10.10.

          Show
          Mike Matrigali added a comment - just reopening while working on backport to 10.10.
          Hide
          Mike Matrigali added a comment -

          looking at backporting this to 10.10

          Show
          Mike Matrigali added a comment - looking at backporting this to 10.10
          Mike Matrigali made changes -
          Assignee Mike Matrigali [ mikem ]
          Hide
          Mike Matrigali added a comment -

          consider for 10.10 backport

          Show
          Mike Matrigali added a comment - consider for 10.10 backport
          Mike Matrigali made changes -
          Link This issue is related to DERBY-6481 [ DERBY-6481 ]
          Kathey Marsden made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 10.11.0.0 [ 12324243 ]
          Resolution Fixed [ 1 ]
          Hide
          Kathey Marsden added a comment -

          Fixing just in trunk for now to monitor any performance change. Fix should be suitable for backport.

          Show
          Kathey Marsden added a comment - Fixing just in trunk for now to monitor any performance change. Fix should be suitable for backport.
          Kathey Marsden made changes -
          Link This issue relates to DERBY-6392 [ DERBY-6392 ]
          Hide
          Kathey Marsden added a comment -

          I tried patched 10.2 and found the prepare was 13 seconds, so it seems this patch did resolve the original performance regression. On patched 10.3 the prepare was 69 seconds, so I guess there was an additional slow down from 10.2 to 10.3 which we are still seeing on trunk 10.11. I will resolve this issue and open another one for the 10.2 to 10.3 performance slow down.

          Show
          Kathey Marsden added a comment - I tried patched 10.2 and found the prepare was 13 seconds, so it seems this patch did resolve the original performance regression. On patched 10.3 the prepare was 69 seconds, so I guess there was an additional slow down from 10.2 to 10.3 which we are still seeing on trunk 10.11. I will resolve this issue and open another one for the 10.2 to 10.3 performance slow down.
          Hide
          ASF subversion and git services added a comment -

          Commit 1534465 from Kathey Marsden in branch 'code/trunk'
          [ https://svn.apache.org/r1534465 ]

          DERBY-2130 - Optimizer performance slowdown from 10.1 to 10.2

          Patch contributed by Army Brown

          Show
          ASF subversion and git services added a comment - Commit 1534465 from Kathey Marsden in branch 'code/trunk' [ https://svn.apache.org/r1534465 ] DERBY-2130 - Optimizer performance slowdown from 10.1 to 10.2 Patch contributed by Army Brown
          Hide
          Bryan Pendleton added a comment -

          +1 to checking in the patch. It clearly makes this case better, and that's a good thing.

          I don't think it's worth any backporting, as nobody is specifically waiting for it
          that I know of. If we put the patch only in trunk for now we can continue to
          try to figure out how to test it better.

          Show
          Bryan Pendleton added a comment - +1 to checking in the patch. It clearly makes this case better, and that's a good thing. I don't think it's worth any backporting, as nobody is specifically waiting for it that I know of. If we put the patch only in trunk for now we can continue to try to figure out how to test it better.
          Hide
          Kathey Marsden added a comment -

          So should I go ahead and check this in as is if tests pass? If so, should it go back to 10.10? We aren't really set up for performance regression tests so I am not really sure what testing we could add.

          Show
          Kathey Marsden added a comment - So should I go ahead and check this in as is if tests pass? If so, should it go back to 10.10? We aren't really set up for performance regression tests so I am not really sure what testing we could add.
          Hide
          Bryan Pendleton added a comment -

          Three times faster is nothing to sneeze at!

          Show
          Bryan Pendleton added a comment - Three times faster is nothing to sneeze at!
          Kathey Marsden made changes -
          Attachment jumpReset_10_11_trunk_diff.txt [ 12609310 ]
          Hide
          Kathey Marsden added a comment -

          I manually merged the changes to trunk (jumpReset_10_11_trunk_diff.txt) and find the query runs in 64 seconds on trunk vs the 13 on 10.1, so three times faster but nowhere near the 10.1 performance. I am using
          java version "1.7.0"
          Java(TM) SE Runtime Environment (build pwi3270sr5-20130619_01(SR5))
          IBM J9 VM (build 2.6, JRE 1.7.0 Windows 7 x86-32 20130617_152572 (JIT enabled, AOT enabled)
          J9VM - R26_Java726_SR5_20130617_1436_B152572
          JIT - r11.b04_20130528_38954ifx1
          GC - R26_Java726_SR5_20130617_1436_B152572
          J9CL - 20130617_152572)
          JCL - 20130616_01 based on Oracle 7u25-b12

          I am not sure if I made an error with the merge or if there is some other regression that has been masked by this one.

          Show
          Kathey Marsden added a comment - I manually merged the changes to trunk (jumpReset_10_11_trunk_diff.txt) and find the query runs in 64 seconds on trunk vs the 13 on 10.1, so three times faster but nowhere near the 10.1 performance. I am using java version "1.7.0" Java(TM) SE Runtime Environment (build pwi3270sr5-20130619_01(SR5)) IBM J9 VM (build 2.6, JRE 1.7.0 Windows 7 x86-32 20130617_152572 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR5_20130617_1436_B152572 JIT - r11.b04_20130528_38954ifx1 GC - R26_Java726_SR5_20130617_1436_B152572 J9CL - 20130617_152572) JCL - 20130616_01 based on Oracle 7u25-b12 I am not sure if I made an error with the merge or if there is some other regression that has been masked by this one.
          Hide
          Kathey Marsden added a comment -

          Verified this is still a problem. On my machine the 10.1 query takes 13 seconds but 203 seconds on trunk.

          Show
          Kathey Marsden added a comment - Verified this is still a problem. On my machine the 10.1 query takes 13 seconds but 203 seconds on trunk.
          Kathey Marsden made changes -
          Issue & fix info High Value Fix,Repro attached [ 10422, 10424 ] High Value Fix,Newcomer,Repro attached [ 10422, 10423, 10424 ]
          Gavin made changes -
          Workflow jira [ 12391074 ] Default workflow, editable Closed status [ 12802014 ]
          Kathey Marsden made changes -
          Labels derby_triage10_5_2 derby_triage10_11
          Hide
          Bryan Pendleton added a comment -

          I don't consider the patch high risk. I think it is ready to pursue. I think it's reasonable
          to mark this as a newcomer issue. I would be willing to help work with anybody who
          has the time/motivation to push this forward.

          Show
          Bryan Pendleton added a comment - I don't consider the patch high risk. I think it is ready to pursue. I think it's reasonable to mark this as a newcomer issue. I would be willing to help work with anybody who has the time/motivation to push this forward.
          Hide
          Kathey Marsden added a comment -

          Is the patch considered high risk? Would it be appropriate to mark this newcomer if someone wants to try out Army's patch, run tests and drive this issue to resolution?

          Show
          Kathey Marsden added a comment - Is the patch considered high risk? Would it be appropriate to mark this newcomer if someone wants to try out Army's patch, run tests and drive this issue to resolution?
          Kathey Marsden made changes -
          Labels derby_triage10_5_2
          Bryan Pendleton made changes -
          Link This issue relates to DERBY-4818 [ DERBY-4818 ]
          Hide
          Bryan Pendleton added a comment -

          Hi Rick,

          I think that Army's patch is a good one, but I'm afraid I don't have time right
          now to push this issue forward.

          Show
          Bryan Pendleton added a comment - Hi Rick, I think that Army's patch is a good one, but I'm afraid I don't have time right now to push this issue forward.
          Hide
          Rick Hillegas added a comment -

          Hi Bryan,

          You contributed a patch for this issue more than 3 years ago but the patch probably needs to be updated before it can be applied now. Are you interested in pursuing this?

          Thanks,
          -Rick

          Show
          Rick Hillegas added a comment - Hi Bryan, You contributed a patch for this issue more than 3 years ago but the patch probably needs to be updated before it can be applied now. Are you interested in pursuing this? Thanks, -Rick
          Knut Anders Hatlen made changes -
          Urgency Normal
          Issue & fix info [High Value Fix] [High Value Fix, Repro attached]
          Hide
          Knut Anders Hatlen added a comment -

          Triaged for 10.5.2.

          Show
          Knut Anders Hatlen added a comment - Triaged for 10.5.2.
          Dag H. Wanvik made changes -
          Bug behavior facts [Regression] [Performance, Regression]
          Issue & fix info [High Value Fix]
          Dag H. Wanvik made changes -
          Bug behavior facts [Performance, High Value Fix] [Regression]
          Dag H. Wanvik made changes -
          Derby Categories [High Value Fix] [High Value Fix, Performance]
          Dag H. Wanvik made changes -
          Component/s Performance [ 11709 ]
          Kathey Marsden made changes -
          Derby Info [Regression]
          Derby Categories [High Value Fix]
          Hide
          Kathey Marsden added a comment -

          I wonder if the patch attached to this issue can be pursued as is or if there is more work that needs to be done. It seems it would be good to get this performance regression fixed.

          Show
          Kathey Marsden added a comment - I wonder if the patch attached to this issue can be pursued as is or if there is more work that needs to be done. It seems it would be good to get this performance regression fixed.
          Hide
          Bryan Pendleton added a comment -

          With jumpReset.patch applied, I cannot reproduce the varying optimize times.
          The times are all in a tight range, after nearly 6x as many tests as produced
          the variable optimize times before.

          Show
          Bryan Pendleton added a comment - With jumpReset.patch applied, I cannot reproduce the varying optimize times. The times are all in a tight range, after nearly 6x as many tests as produced the variable optimize times before.
          A B made changes -
          Attachment jumpReset.patch [ 12347032 ]
          Hide
          A B added a comment -

          > That is, once I saw 500 seconds for the prepare on database 6, but I
          > reran DB 6 and saw 168 seconds. [...] I don't think that the non-
          > determinism is "tied" to a particular database and set of tables;
          > the same database and tables can sometimes lead to one prepare time,
          > and sometimes to a different prepare time.

          As I was looking at this problem I noticed a bug in the code that may be causing this variance. It looks like we are not resetting the permuteState field in OptimizerImpl at the beginning of each "round" of optimization. The result (in the cases I saw) was that the field has the value WALK_LOW at the start of the second round of optimization, which, in very timing- sensitive situations, can lead to a code-path that causes the bestCost for the round to remain UNSET--i.e. it sits as Double.MAX_VALUE. Then if we try to add anything to that cost, we end up with Infinity, which effectively ruins most subsequent cost estimates.

          I still have not been able to reproduce the variance that Bryan reports, so I cannot say for sure that this is the cause. I ran the repro 10 times against the same database and I consistently saw times of 550 to 700 seconds--and in each case I could see that the optimizer tried out all 40k permutations before completing. Then I added code to reset the permuteState variable (see attached jumpReset.patch) and ran the repro 10 more times against the same database. In this case every execution completed in about 200 to 220 seconds, and I could see that the optimizer timed out before trying out all permutations (which is more what I would expect).

          The reason I bring this is up is because the compilation time without jumpReset.patch and the compilation time with it appear to correlate to the times that Bryan is seeing in his variance (based on his previous comment). So I'm wondering if the two are related...?

          Bryan, I know you are busy and that running these tests takes a while, but if you have the time/inclination, can you apply jumpReset.patch to your codeline and run your tests again? I'm wondering what effect (if any) the patch will have on the variance that you are seeing...

          Show
          A B added a comment - > That is, once I saw 500 seconds for the prepare on database 6, but I > reran DB 6 and saw 168 seconds. [...] I don't think that the non- > determinism is "tied" to a particular database and set of tables; > the same database and tables can sometimes lead to one prepare time, > and sometimes to a different prepare time. As I was looking at this problem I noticed a bug in the code that may be causing this variance. It looks like we are not resetting the permuteState field in OptimizerImpl at the beginning of each "round" of optimization. The result (in the cases I saw) was that the field has the value WALK_LOW at the start of the second round of optimization, which, in very timing- sensitive situations, can lead to a code-path that causes the bestCost for the round to remain UNSET--i.e. it sits as Double.MAX_VALUE. Then if we try to add anything to that cost, we end up with Infinity, which effectively ruins most subsequent cost estimates. I still have not been able to reproduce the variance that Bryan reports, so I cannot say for sure that this is the cause. I ran the repro 10 times against the same database and I consistently saw times of 550 to 700 seconds--and in each case I could see that the optimizer tried out all 40k permutations before completing. Then I added code to reset the permuteState variable (see attached jumpReset.patch) and ran the repro 10 more times against the same database. In this case every execution completed in about 200 to 220 seconds, and I could see that the optimizer timed out before trying out all permutations (which is more what I would expect). The reason I bring this is up is because the compilation time without jumpReset.patch and the compilation time with it appear to correlate to the times that Bryan is seeing in his variance (based on his previous comment). So I'm wondering if the two are related...? Bryan, I know you are busy and that running these tests takes a while, but if you have the time/inclination, can you apply jumpReset.patch to your codeline and run your tests again? I'm wondering what effect (if any) the patch will have on the variance that you are seeing...
          Hide
          Bryan Pendleton added a comment -

          > My guess is that if you created 10.1 database A, and ran 10.3 against
          > it, and then created 10.1 database B, and ran 10.3 against it, you
          > might see dramatic differences even in just those two runs.

          I was able to confirm this.

          I created 10 separate databases using 10.1.2.1, and ran the DDL portion of repro.sql
          against each of those 10 databases using 10.1.2.1.

          Then I ran just the PREPARE portion of repro.sql against each of those 10 databases,
          using the 10.3 trunk, and setting derby.database.allowPreReleaseUpgrade=true so
          that 10.3 would be willing to access the 10.1 database.

          Often, I observed 170-180 seconds for the prepare.

          But I also observed 500-650 seconds for the prepare.

          Furthermore, it wasn't on the same database each time!

          That is, once I saw 500 seconds for the prepare on database 6, but I reran DB 6
          and saw 168 seconds.

          And, the first time I did the prepare on DB 8, it was 170 seconds, but the second
          time I did the prepare on DB 8, it was 607 seconds.

          So my observations are that:

          • 10.3 is non-deterministic regarding the amount of time that the PREPARE takes
          • it happens whether the DB and schema are created using 10.1, or 10.3
          • the same DB, with the same DDL, can sometimes prepare the query in time X,
            and sometimes in time Y, and as far as I can tell nothing is changing
            between runs. I just run IJ with the "prepare" script twice in a row, and
            get two very different elapsed times.

          > I'm wondering if there is something going on, correlated to the DDL
          > statements executed by the repro script, which sometimes results in
          > tables that return "reasonable" estimates, and sometimes results in
          > tables that return "infinite" estimates.

          I now believe this is not an accurate theory, as the non-determinism does not
          seem to be correlated to a particular run of the DDL statements.

          So at this point I don't think that the non-determinism depends on whether the
          database and its tables were created by 10.1, or by 10.3, and I don't think that
          the non-determinism is "tied" to a particular database and set of tables; the
          same database and tables can sometimes lead to one prepare time, and sometimes
          to a different prepare time.

          Show
          Bryan Pendleton added a comment - > My guess is that if you created 10.1 database A, and ran 10.3 against > it, and then created 10.1 database B, and ran 10.3 against it, you > might see dramatic differences even in just those two runs. I was able to confirm this. I created 10 separate databases using 10.1.2.1, and ran the DDL portion of repro.sql against each of those 10 databases using 10.1.2.1. Then I ran just the PREPARE portion of repro.sql against each of those 10 databases, using the 10.3 trunk, and setting derby.database.allowPreReleaseUpgrade=true so that 10.3 would be willing to access the 10.1 database. Often, I observed 170-180 seconds for the prepare. But I also observed 500-650 seconds for the prepare. Furthermore, it wasn't on the same database each time! That is, once I saw 500 seconds for the prepare on database 6, but I reran DB 6 and saw 168 seconds. And, the first time I did the prepare on DB 8, it was 170 seconds, but the second time I did the prepare on DB 8, it was 607 seconds. So my observations are that: 10.3 is non-deterministic regarding the amount of time that the PREPARE takes it happens whether the DB and schema are created using 10.1, or 10.3 the same DB, with the same DDL, can sometimes prepare the query in time X, and sometimes in time Y, and as far as I can tell nothing is changing between runs. I just run IJ with the "prepare" script twice in a row, and get two very different elapsed times. > I'm wondering if there is something going on, correlated to the DDL > statements executed by the repro script, which sometimes results in > tables that return "reasonable" estimates, and sometimes results in > tables that return "infinite" estimates. I now believe this is not an accurate theory, as the non-determinism does not seem to be correlated to a particular run of the DDL statements. So at this point I don't think that the non-determinism depends on whether the database and its tables were created by 10.1, or by 10.3, and I don't think that the non-determinism is "tied" to a particular database and set of tables; the same database and tables can sometimes lead to one prepare time, and sometimes to a different prepare time.
          Hide
          A B added a comment -

          > Is it possible that the 10.1 vs 10.3 database is a red herring here,
          > and this is simply the same "non-determinism" that I was observing
          > in my initial tests?

          Good point, very well could be. That didn't occur to me.

          > My guess is that if you created 10.1 database A, and ran 10.3 against
          > it, and then created 10.1 database B, and ran 10.3 against it, you
          > might see dramatic differences even in just those two runs.

          Have you yourself seen this variance with 10.1 databases, as well? I was limiting myself to the details in the description of this issue, which only mention the variance with 10.2 and trunk. If you are also seeing the variance in 10.1, then can we assume that the non-deterministic variance aspect of this issue is not a regression, but is an existing problem with Derby since who-knows-when? That might correlate with Mike's earlier comments about that being a potential issue in the past (hash codes based on DDL)...

          > I'm wondering if there is something going on, correlated to the DDL
          > statements executed by the repro script, which sometimes results in
          > tables that return "reasonable" estimates, and sometimes results in
          > tables that return "infinite" estimates.

          This is great question. A follow-up question would be whether or not this DDL-based variance is a regression from 10.1 or has been there from the "beginning".

          As you can tell, my primary focus here has been on potentially regressed behavior. As I've been looking at this problem I have been asking myself "what changed from 10.1 to 10.2 to cause this sudden slow-down?" So that's where my mind (and all of my questions) have been leading...

          And on that point, I mentioned in my last comment that the JUMP behavior "does not take effect" for 10.3, but it does for 10.1. I spent some time tracing through the code and quickly realized that this change in behavior is an indirect result of DERBY-1357. Here's why.

          As explained on the wiki page that describes jumping (http://wiki.apache.org/db-derby/JoinOrderPermutations), the optimizer will find the first complete join order for a query before attempting to jump. In the repro query the first join order is simply "0 1 2 3 4 5 6 7". The associated cost is 5.607E62. Using the row counts calculated for this join order, the "target" join order then becomes "1 4 2 6 7 0 3 5". This means that the next 8 calls to "getNextPermutation()" will return partial join orders that build up to this target. i.e.

          1 -1 -1 -1 -1 -1 -1 -1
          1 4 -1 -1 -1 -1 -1 -1
          1 4 2 -1 -1 -1 -1 -1
          etc.

          Prior to DERBY-1357 the optimizer "short-circuit" logic was not working, so even if a partial join order is more expensive than the best complete join order so far, the optimizer will continue to waste time optimizing join orders that are always going to return an estimate that is higher than the best so far. So in 10.1 (which does not have the fix for DERBY-1357) we'll merrily continue on our way from [1 4 2 ... ] all the way up until we either timeout or find a join order that is cheaper than 5.607E62. We eventually find a cheaper join order after about 2500 permutations, and that's the point at which optimization stops (see details from my previous comment).

          With 10.2 and trunk, though, the DERBY-1357 short-circuit logic has been corrected. And as it happens, the estimated cost for the partial join order "1 4 2 6 7 0 3" is "Infinity"; since that's greater than 5.607E62, we do not advance the join order and thus we effectively skip the rest of it (because we know it's going to return a cost that's worse than 5.607E62). That part is, I think, correct.

          But having chosen to NOT advance the join order, now we come to the following block of code in OptimizerImpl:

          if (permuteState == JUMPING && !joinPosAdvanced && joinPosition >= 0)

          { //not feeling well in the middle of jump // Note: we have to make sure we reload the best plans // as we rewind since they may have been clobbered // (as part of the current join order) before we gave // up on jumping. reloadBestPlan = true; rewindJoinOrder(); //fall permuteState = NO_JUMP; //give up }

          Since we're in the middle of jumping and since we did not advance the join order, we execute this if block--i.e. we rewind the join order back to the beginning and start all over. Or put another way, we gave up on the "JUMP" and went back to normal processing. From there we then have to go through thousands and thousands of permutations before we find a plan that's cheaper than 5.607E62. Depending on which database is in use, that can be ten thousand permutations or it could be forty thousand, which explains the variance in the time taken to complete optimization.

          I think the changes for DERBY-1357 are correct; what's not clear is whether the "if" block shown above is really necessary. The idea sort of makes sense to me: if we broke away from normal processing and jumped to a bad join order then we want to abort the jump as quickly as possible and return to normal processing. However, now that the optimizer "short-circuit" logic is fixed, I'm not so sure we need to revert back to "normal" processing when we jump to a bad join order. If the join order is bad then the short-circuit logic will make sure that we do not waste "too much" (yes, that's an awfully subjective term) time trying out bad join orders. So would it make sense to just "iterate" the target join order as usual without defaulting back to square one?

          As a sanity check I commented the above "if" block out in the trunk codeline and ran the repro; when I did so, the script executed in about the same time as it does on the latest 10.1 branch. So this seems to confirm my findings. FAR more testing is needed, though, to see if this simple removal of the "if" block is a viable solution.

          All of that said, I still think the BIGGER problems we are seeing are 1) DERBY-1905 and 2) as Bryan has said, the apparent variance in cost estimates based on DDL. DERBY-1905 is definitely playing a huge role here, what with the Infinity cost estimates followed by NaN. These kinds of estimates render the vast majority of the join order estimates useless, even if we do remove the "if" block mentioned above.

          Also note: when I remove the "if" block as mentioned above, I found that sometimes the cost estimate for the "best" join order ("2 0 1 3 4 5 6 7") was "14116", and sometimes it was "247". This was on the same database and so does not appear to be the result of DDL statements. I'm still looking into this little tidbit to see what might be going on there...

          As usual, some of this may be useful, some of it may not be. Please excuse me if I'm babbling...

          Show
          A B added a comment - > Is it possible that the 10.1 vs 10.3 database is a red herring here, > and this is simply the same "non-determinism" that I was observing > in my initial tests? Good point, very well could be. That didn't occur to me. > My guess is that if you created 10.1 database A, and ran 10.3 against > it, and then created 10.1 database B, and ran 10.3 against it, you > might see dramatic differences even in just those two runs. Have you yourself seen this variance with 10.1 databases, as well? I was limiting myself to the details in the description of this issue, which only mention the variance with 10.2 and trunk. If you are also seeing the variance in 10.1, then can we assume that the non-deterministic variance aspect of this issue is not a regression, but is an existing problem with Derby since who-knows-when? That might correlate with Mike's earlier comments about that being a potential issue in the past (hash codes based on DDL)... > I'm wondering if there is something going on, correlated to the DDL > statements executed by the repro script, which sometimes results in > tables that return "reasonable" estimates, and sometimes results in > tables that return "infinite" estimates. This is great question. A follow-up question would be whether or not this DDL-based variance is a regression from 10.1 or has been there from the "beginning". As you can tell, my primary focus here has been on potentially regressed behavior. As I've been looking at this problem I have been asking myself "what changed from 10.1 to 10.2 to cause this sudden slow-down?" So that's where my mind (and all of my questions) have been leading... And on that point, I mentioned in my last comment that the JUMP behavior "does not take effect" for 10.3, but it does for 10.1. I spent some time tracing through the code and quickly realized that this change in behavior is an indirect result of DERBY-1357 . Here's why. As explained on the wiki page that describes jumping ( http://wiki.apache.org/db-derby/JoinOrderPermutations ), the optimizer will find the first complete join order for a query before attempting to jump. In the repro query the first join order is simply "0 1 2 3 4 5 6 7". The associated cost is 5.607E62. Using the row counts calculated for this join order, the "target" join order then becomes "1 4 2 6 7 0 3 5". This means that the next 8 calls to "getNextPermutation()" will return partial join orders that build up to this target. i.e. 1 -1 -1 -1 -1 -1 -1 -1 1 4 -1 -1 -1 -1 -1 -1 1 4 2 -1 -1 -1 -1 -1 etc. Prior to DERBY-1357 the optimizer "short-circuit" logic was not working, so even if a partial join order is more expensive than the best complete join order so far, the optimizer will continue to waste time optimizing join orders that are always going to return an estimate that is higher than the best so far. So in 10.1 (which does not have the fix for DERBY-1357 ) we'll merrily continue on our way from [1 4 2 ... ] all the way up until we either timeout or find a join order that is cheaper than 5.607E62. We eventually find a cheaper join order after about 2500 permutations, and that's the point at which optimization stops (see details from my previous comment). With 10.2 and trunk, though, the DERBY-1357 short-circuit logic has been corrected. And as it happens, the estimated cost for the partial join order "1 4 2 6 7 0 3" is "Infinity"; since that's greater than 5.607E62, we do not advance the join order and thus we effectively skip the rest of it (because we know it's going to return a cost that's worse than 5.607E62). That part is, I think, correct. But having chosen to NOT advance the join order, now we come to the following block of code in OptimizerImpl: if (permuteState == JUMPING && !joinPosAdvanced && joinPosition >= 0) { //not feeling well in the middle of jump // Note: we have to make sure we reload the best plans // as we rewind since they may have been clobbered // (as part of the current join order) before we gave // up on jumping. reloadBestPlan = true; rewindJoinOrder(); //fall permuteState = NO_JUMP; //give up } Since we're in the middle of jumping and since we did not advance the join order, we execute this if block--i.e. we rewind the join order back to the beginning and start all over. Or put another way, we gave up on the "JUMP" and went back to normal processing. From there we then have to go through thousands and thousands of permutations before we find a plan that's cheaper than 5.607E62. Depending on which database is in use, that can be ten thousand permutations or it could be forty thousand, which explains the variance in the time taken to complete optimization. I think the changes for DERBY-1357 are correct; what's not clear is whether the "if" block shown above is really necessary. The idea sort of makes sense to me: if we broke away from normal processing and jumped to a bad join order then we want to abort the jump as quickly as possible and return to normal processing. However, now that the optimizer "short-circuit" logic is fixed, I'm not so sure we need to revert back to "normal" processing when we jump to a bad join order. If the join order is bad then the short-circuit logic will make sure that we do not waste "too much" (yes, that's an awfully subjective term) time trying out bad join orders. So would it make sense to just "iterate" the target join order as usual without defaulting back to square one? As a sanity check I commented the above "if" block out in the trunk codeline and ran the repro; when I did so, the script executed in about the same time as it does on the latest 10.1 branch. So this seems to confirm my findings. FAR more testing is needed, though, to see if this simple removal of the "if" block is a viable solution. All of that said, I still think the BIGGER problems we are seeing are 1) DERBY-1905 and 2) as Bryan has said, the apparent variance in cost estimates based on DDL. DERBY-1905 is definitely playing a huge role here, what with the Infinity cost estimates followed by NaN. These kinds of estimates render the vast majority of the join order estimates useless, even if we do remove the "if" block mentioned above. Also note: when I remove the "if" block as mentioned above, I found that sometimes the cost estimate for the "best" join order ("2 0 1 3 4 5 6 7") was "14116", and sometimes it was "247". This was on the same database and so does not appear to be the result of DDL statements. I'm still looking into this little tidbit to see what might be going on there... As usual, some of this may be useful, some of it may not be. Please excuse me if I'm babbling...
          Hide
          Bryan Pendleton added a comment -

          > When running against a 10.1 database 10.3 estimates a cost of 1736685, followed up by a
          > cost of 198589 for the next complete order, which leads leads to eventual timeout. But when
          > running against a 10.3 database, the cost estimate is Infinity and thus the timeout value
          > sits at 5E62, which we never reach.

          Is it possible that the 10.1 vs 10.3 database is a red herring here, and this is simply the
          same "non-determinism" that I was observing in my initial tests? That is, you ran against two
          different database, thus you ran all the CREATE TABLE/INDEX/VIEW statements twice,
          and got hugely different results. I definitely saw dramatically different results even in the
          10.3-only case from run to run, when running the entire script which recreates all the tables each time.

          My guess is that if you created 10.1 database A, and ran 10.3 against it, and then created
          10.1 database B, and ran 10.3 against it, you might see dramatic differences even in just
          those two runs.

          I'm wondering if there is something going on, correlated to the DDL statements executed
          by the repro script, which sometimes results in tables that return "reasonable"
          estimates, and sometimes results in tables that return "infinite" estimates.

          Show
          Bryan Pendleton added a comment - > When running against a 10.1 database 10.3 estimates a cost of 1736685, followed up by a > cost of 198589 for the next complete order, which leads leads to eventual timeout. But when > running against a 10.3 database, the cost estimate is Infinity and thus the timeout value > sits at 5E62, which we never reach. Is it possible that the 10.1 vs 10.3 database is a red herring here, and this is simply the same "non-determinism" that I was observing in my initial tests? That is, you ran against two different database, thus you ran all the CREATE TABLE/INDEX/VIEW statements twice, and got hugely different results. I definitely saw dramatically different results even in the 10.3-only case from run to run, when running the entire script which recreates all the tables each time. My guess is that if you created 10.1 database A, and ran 10.3 against it, and then created 10.1 database B, and ran 10.3 against it, you might see dramatic differences even in just those two runs. I'm wondering if there is something going on, correlated to the DDL statements executed by the repro script, which sometimes results in tables that return "reasonable" estimates, and sometimes results in tables that return "infinite" estimates.
          Hide
          A B added a comment -

          Thanks for the query plans, Bryan. I took a quick look and nothing stood out, but I'll have to look more closely before I can say anything for sure. Thank you for taking the time to pull out the "noise"--I think that'll be helpful.

          On a different note, I added some System.outs to various places in the Optimizer to see what cost estimates the optimizer is calculating for different complete join orders in the top-level OptimizerImpl. I then ran the repro query against the latest 10.1 branch and the latest 10.3 trunk. The abbreviated output from what I saw is shown below.

          Some things to note:

          0. Lots of "Infinity" and "NaN" cost estimates, which are probably the result of DERBY-1905 and DERBY-1260, respectively. This pretty much renders all but a handful of the cost estimates useless.

          1. In 10.1.3 we try to jump to a target join order and we SUCCESSFULLY do so. In 10.3 we also try to jump to the same target join order, BUT for some reason the jump does NOT take effect. So it looks like Bryan's question about whether or not permutation jumping works correctly was indeed well-founded. This is, I think, the first thing to investigate.

          2. Because the 10.3 "jump" does not occur we spend a lot of time iterating through different permutations until we get to the "best join order" found by 10.1. Once we get there the estimated cost for 10.3 (on a 10.1 database) is finite but is much higher than the cost estimate of the same join order in 10.1 (namely, "247" in 10.1 vs "1736685" in 10.3). So this is another point of interest. This could again be the result of DERBY-1260, but it's hard to say at this point.

          3. This next observation is weird: when running 10.3 on a 10.1 database the query finishes after about 10,000 iterations (roughly 5 minutes on my laptop). But when running 10.3 on a 10.3 database the query runs until it has exhausted all possible permutations (over 40,000). The difference is in the calculated cost estimate for the "2 0 1 3 4 5 6 7" join order. When running against a 10.1 database 10.3 estimates a cost of 1736685, followed up by a cost of 198589 for the next complete order, which leads leads to eventual timeout. But when running against a 10.3 database, the cost estimate is Infinity and thus the timeout value sits at 5E62, which we never reach. I have absolutely no idea why this would be the case. Any one out there have any theories? Bryan, can you perhaps try this out to see if you observe the same behavior? You can connect to a 10.1 database by setting the "derby.database.allowPreReleaseUpgrade" property to "true". Note that you should only use that on a test 10.1 database, though; I don't think it's intended for production db's.

          Generally speaking lines in the output are of the form:

          -=> estimatedCost: j o i n O r d e r

          --------------------------
          10.1.3 on 10.1.3 database:
          --------------------------

          ij version 10.1
          -=> 5.607246651648704E62: 0 1 2 3 4 5 6 7
          -==> Just set bestCost to: 5.607246651648704E62
          => JUMPING to: 1 4 2 6 7 0 3 5
          -=> Infinity: 1 4 2 6 7 0 3 5
          -=> NaN: 1 4 2 6 7 0 5 3
          -=> NaN: 1 4 2 6 7 3 0 5
          -=> NaN: 1 4 2 6 7 3 5 0
          -=> NaN: 1 4 2 6 7 5 0 3
          -=> NaN: 1 4 2 6 7 5 3 0
          -=> NaN: 1 4 2 7 0 3 5 6
          -=> NaN: 1 4 2 7 0 3 6 5
          -=> NaN: 1 4 2 7 0 5 3 6
          -=> NaN: 1 4 2 7 0 5 6 3
          -=> NaN: 1 4 2 7 0 6 3 5
          -=> NaN: 1 4 2 7 0 6 5 3

          <snip ~2,600 permutations with cost estimates of "NaN">

          -=> NaN: 1 7 6 5 4 0 2 3
          -=> NaN: 1 7 6 5 4 0 3 2
          -=> NaN: 1 7 6 5 4 2 0 3
          -=> NaN: 1 7 6 5 4 2 3 0
          -=> NaN: 1 7 6 5 4 3 0 2
          -=> NaN: 1 7 6 5 4 3 2 0
          -=> 247.4647753453739: 2 0 1 3 4 5 6 7
          -==> Just set bestCost to: 247.4647753453739
          => TIMED OUT: exceeded 247.4647753453739 ms.
          -=> 247.4647753453739: 2 0 1 3 4 5 6 7
          ELAPSED TIME = 82961 milliseconds

          ------------------------
          10.3 on 10.1.3 database:
          ------------------------

          (on the same database that was created by 10.1)

          ij version 10.3
          -=> 5.607246651648704E62: 0 1 2 3 4 5 6 7
          -==> Just set bestCost to: 5.607246651648704E62
          => JUMPING to: 1 4 2 6 7 0 3 5
          -=> 1.8253980349686949E106: 0 1 2 3 4 5 6 7
          -=> 3.021440681134236E107: 0 1 2 3 4 5 7 6
          -=> 8.374147544754745E119: 0 1 2 3 4 6 5 7
          -=> 8.374147544754745E119: 0 1 2 3 4 6 7 5
          -=> 5.993308284393933E109: 0 1 2 3 4 7 5 6
          -=> 2.736734332816085E108: 0 1 2 3 4 7 6 5
          -=> 3.973494182787193E115: 0 1 2 3 5 4 6 7
          -=> 6.577018684224087E116: 0 1 2 3 5 4 7 6
          -=> 3.973494182787193E115: 0 1 2 3 5 6 4 7
          -=> 6.577018684224087E116: 0 1 2 3 5 7 4 6
          -=> 8.374147544754745E119: 0 1 2 3 6 4 5 7
          -=> 8.374147544754745E119: 0 1 2 3 6 4 7 5
          -=> 1.0852895218002162E119: 0 1 2 3 6 5 4 7
          -=> 8.374147544754745E119: 0 1 2 3 6 7 4 5
          -=> 5.993308284393933E109: 0 1 2 3 7 4 5 6
          -=> 2.736734332816085E108: 0 1 2 3 7 4 6 5
          -=> 1.2945545894290903E112: 0 1 2 3 7 5 4 6
          -=> 2.736734332816085E108: 0 1 2 3 7 6 4 5
          -=> NaN: 0 1 2 4 5 3 6 7
          -=> NaN: 0 1 2 4 5 3 7 6
          -=> NaN: 0 1 2 4 5 6 3 7
          -=> NaN: 0 1 2 4 5 6 7 3
          -=> NaN: 0 1 2 4 5 7 3 6
          -=> NaN: 0 1 2 4 5 7 6 3

          <snip ~10,000 permutations with cost estimates of "NaN">

          -=> NaN: 1 7 6 5 4 0 2 3
          -=> NaN: 1 7 6 5 4 0 3 2
          -=> NaN: 1 7 6 5 4 2 0 3
          -=> NaN: 1 7 6 5 4 2 3 0
          -=> NaN: 1 7 6 5 4 3 0 2
          -=> NaN: 1 7 6 5 4 3 2 0
          -=> 1736685.050617207: 2 0 1 3 4 5 6 7
          -==> Just set bestCost to: 1736685.050617207
          -=> 198589.53739726284: 2 0 1 3 4 5 7 6
          -==> Just set bestCost to: 198589.53739726284
          => TIMED OUT: exceeded 198589.53739726284 ms.
          -=> 198589.53739726284: 2 0 1 3 4 5 7 6
          ELAPSED TIME = 336370 milliseconds

          ------------------------
          10.3 on 10.3 database:
          ------------------------

          ij version 10.3
          -=> 5.607246651648704E62: 0 1 2 3 4 5 6 7
          -==> Just set bestCost to: 5.607246651648704E62
          => JUMPING to: 1 4 2 6 7 0 3 5
          -=> 1.8253980349686949E106: 0 1 2 3 4 5 6 7
          -=> 3.021440681134236E107: 0 1 2 3 4 5 7 6
          -=> 8.374147544754745E119: 0 1 2 3 4 6 5 7
          -=> 8.374147544754745E119: 0 1 2 3 4 6 7 5
          -=> 5.993308284393933E109: 0 1 2 3 4 7 5 6
          -=> 2.736734332816085E108: 0 1 2 3 4 7 6 5
          -=> 3.973494182787193E115: 0 1 2 3 5 4 6 7
          -=> 6.577018684224087E116: 0 1 2 3 5 4 7 6
          -=> 3.973494182787193E115: 0 1 2 3 5 6 4 7
          -=> 6.577018684224087E116: 0 1 2 3 5 7 4 6
          -=> 8.374147544754745E119: 0 1 2 3 6 4 5 7
          -=> 8.374147544754745E119: 0 1 2 3 6 4 7 5
          -=> 1.0852895218002162E119: 0 1 2 3 6 5 4 7
          -=> 8.374147544754745E119: 0 1 2 3 6 7 4 5
          -=> 5.993308284393933E109: 0 1 2 3 7 4 5 6
          -=> 2.736734332816085E108: 0 1 2 3 7 4 6 5
          -=> 1.2945545894290903E112: 0 1 2 3 7 5 4 6
          -=> 2.736734332816085E108: 0 1 2 3 7 6 4 5
          -=> NaN: 0 1 2 4 5 3 6 7
          -=> NaN: 0 1 2 4 5 3 7 6
          -=> NaN: 0 1 2 4 5 6 3 7
          -=> NaN: 0 1 2 4 5 6 7 3
          -=> NaN: 0 1 2 4 5 7 3 6

          <snip ~10,000 permutations with cost estimates of "NaN">

          -=> NaN: 1 7 6 5 4 2 0 3
          -=> NaN: 1 7 6 5 4 2 3 0
          -=> NaN: 1 7 6 5 4 3 0 2
          -=> NaN: 1 7 6 5 4 3 2 0
          -=> Infinity: 2 0 1 3 4 5 6 7
          -=> NaN: 2 0 1 3 4 5 7 6
          -=> NaN: 2 0 1 3 4 6 5 7
          -=> NaN: 2 0 1 3 4 6 7 5
          -=> NaN: 2 0 1 3 4 7 5 6
          -=> NaN: 2 0 1 3 4 7 6 5
          -=> NaN: 2 0 1 3 5 4 6 7

          <snip ~30,000 permutations, most with cost estimates of "NaN", a few with estimates of Infinity>

          -=> NaN: 7 6 5 4 3 1 0 2
          -=> NaN: 7 6 5 4 3 1 2 0
          -=> NaN: 7 6 5 4 3 2 0 1
          -=> NaN: 7 6 5 4 3 2 1 0
          => ENDED because ran out of permutations.
          -=> 5.607246651648704E62: 0 1 2 3 4 5 6 7
          ELAPSED TIME = 1145530 milliseconds

          I'm still trying to make sense of this data, but perhaps this info will be helpful in tracking down the issue(s) here...

          Show
          A B added a comment - Thanks for the query plans, Bryan. I took a quick look and nothing stood out, but I'll have to look more closely before I can say anything for sure. Thank you for taking the time to pull out the "noise"--I think that'll be helpful. On a different note, I added some System.outs to various places in the Optimizer to see what cost estimates the optimizer is calculating for different complete join orders in the top-level OptimizerImpl. I then ran the repro query against the latest 10.1 branch and the latest 10.3 trunk. The abbreviated output from what I saw is shown below. Some things to note: 0. Lots of "Infinity" and "NaN" cost estimates, which are probably the result of DERBY-1905 and DERBY-1260 , respectively. This pretty much renders all but a handful of the cost estimates useless. 1. In 10.1.3 we try to jump to a target join order and we SUCCESSFULLY do so. In 10.3 we also try to jump to the same target join order, BUT for some reason the jump does NOT take effect. So it looks like Bryan's question about whether or not permutation jumping works correctly was indeed well-founded. This is, I think, the first thing to investigate. 2. Because the 10.3 "jump" does not occur we spend a lot of time iterating through different permutations until we get to the "best join order" found by 10.1. Once we get there the estimated cost for 10.3 (on a 10.1 database) is finite but is much higher than the cost estimate of the same join order in 10.1 (namely, "247" in 10.1 vs "1736685" in 10.3). So this is another point of interest. This could again be the result of DERBY-1260 , but it's hard to say at this point. 3. This next observation is weird: when running 10.3 on a 10.1 database the query finishes after about 10,000 iterations (roughly 5 minutes on my laptop). But when running 10.3 on a 10.3 database the query runs until it has exhausted all possible permutations (over 40,000). The difference is in the calculated cost estimate for the "2 0 1 3 4 5 6 7" join order. When running against a 10.1 database 10.3 estimates a cost of 1736685, followed up by a cost of 198589 for the next complete order, which leads leads to eventual timeout. But when running against a 10.3 database, the cost estimate is Infinity and thus the timeout value sits at 5E62, which we never reach. I have absolutely no idea why this would be the case. Any one out there have any theories? Bryan, can you perhaps try this out to see if you observe the same behavior? You can connect to a 10.1 database by setting the "derby.database.allowPreReleaseUpgrade" property to "true". Note that you should only use that on a test 10.1 database, though; I don't think it's intended for production db's. Generally speaking lines in the output are of the form: -=> estimatedCost: j o i n O r d e r -------------------------- 10.1.3 on 10.1.3 database: -------------------------- ij version 10.1 -=> 5.607246651648704E62: 0 1 2 3 4 5 6 7 -==> Just set bestCost to: 5.607246651648704E62 = > JUMPING to: 1 4 2 6 7 0 3 5 -=> Infinity: 1 4 2 6 7 0 3 5 -=> NaN: 1 4 2 6 7 0 5 3 -=> NaN: 1 4 2 6 7 3 0 5 -=> NaN: 1 4 2 6 7 3 5 0 -=> NaN: 1 4 2 6 7 5 0 3 -=> NaN: 1 4 2 6 7 5 3 0 -=> NaN: 1 4 2 7 0 3 5 6 -=> NaN: 1 4 2 7 0 3 6 5 -=> NaN: 1 4 2 7 0 5 3 6 -=> NaN: 1 4 2 7 0 5 6 3 -=> NaN: 1 4 2 7 0 6 3 5 -=> NaN: 1 4 2 7 0 6 5 3 <snip ~2,600 permutations with cost estimates of "NaN"> -=> NaN: 1 7 6 5 4 0 2 3 -=> NaN: 1 7 6 5 4 0 3 2 -=> NaN: 1 7 6 5 4 2 0 3 -=> NaN: 1 7 6 5 4 2 3 0 -=> NaN: 1 7 6 5 4 3 0 2 -=> NaN: 1 7 6 5 4 3 2 0 -=> 247.4647753453739: 2 0 1 3 4 5 6 7 -==> Just set bestCost to: 247.4647753453739 = > TIMED OUT: exceeded 247.4647753453739 ms. -=> 247.4647753453739: 2 0 1 3 4 5 6 7 ELAPSED TIME = 82961 milliseconds ------------------------ 10.3 on 10.1.3 database: ------------------------ (on the same database that was created by 10.1) ij version 10.3 -=> 5.607246651648704E62: 0 1 2 3 4 5 6 7 -==> Just set bestCost to: 5.607246651648704E62 = > JUMPING to: 1 4 2 6 7 0 3 5 -=> 1.8253980349686949E106: 0 1 2 3 4 5 6 7 -=> 3.021440681134236E107: 0 1 2 3 4 5 7 6 -=> 8.374147544754745E119: 0 1 2 3 4 6 5 7 -=> 8.374147544754745E119: 0 1 2 3 4 6 7 5 -=> 5.993308284393933E109: 0 1 2 3 4 7 5 6 -=> 2.736734332816085E108: 0 1 2 3 4 7 6 5 -=> 3.973494182787193E115: 0 1 2 3 5 4 6 7 -=> 6.577018684224087E116: 0 1 2 3 5 4 7 6 -=> 3.973494182787193E115: 0 1 2 3 5 6 4 7 -=> 6.577018684224087E116: 0 1 2 3 5 7 4 6 -=> 8.374147544754745E119: 0 1 2 3 6 4 5 7 -=> 8.374147544754745E119: 0 1 2 3 6 4 7 5 -=> 1.0852895218002162E119: 0 1 2 3 6 5 4 7 -=> 8.374147544754745E119: 0 1 2 3 6 7 4 5 -=> 5.993308284393933E109: 0 1 2 3 7 4 5 6 -=> 2.736734332816085E108: 0 1 2 3 7 4 6 5 -=> 1.2945545894290903E112: 0 1 2 3 7 5 4 6 -=> 2.736734332816085E108: 0 1 2 3 7 6 4 5 -=> NaN: 0 1 2 4 5 3 6 7 -=> NaN: 0 1 2 4 5 3 7 6 -=> NaN: 0 1 2 4 5 6 3 7 -=> NaN: 0 1 2 4 5 6 7 3 -=> NaN: 0 1 2 4 5 7 3 6 -=> NaN: 0 1 2 4 5 7 6 3 <snip ~10,000 permutations with cost estimates of "NaN"> -=> NaN: 1 7 6 5 4 0 2 3 -=> NaN: 1 7 6 5 4 0 3 2 -=> NaN: 1 7 6 5 4 2 0 3 -=> NaN: 1 7 6 5 4 2 3 0 -=> NaN: 1 7 6 5 4 3 0 2 -=> NaN: 1 7 6 5 4 3 2 0 -=> 1736685.050617207: 2 0 1 3 4 5 6 7 -==> Just set bestCost to: 1736685.050617207 -=> 198589.53739726284: 2 0 1 3 4 5 7 6 -==> Just set bestCost to: 198589.53739726284 = > TIMED OUT: exceeded 198589.53739726284 ms. -=> 198589.53739726284: 2 0 1 3 4 5 7 6 ELAPSED TIME = 336370 milliseconds ------------------------ 10.3 on 10.3 database: ------------------------ ij version 10.3 -=> 5.607246651648704E62: 0 1 2 3 4 5 6 7 -==> Just set bestCost to: 5.607246651648704E62 = > JUMPING to: 1 4 2 6 7 0 3 5 -=> 1.8253980349686949E106: 0 1 2 3 4 5 6 7 -=> 3.021440681134236E107: 0 1 2 3 4 5 7 6 -=> 8.374147544754745E119: 0 1 2 3 4 6 5 7 -=> 8.374147544754745E119: 0 1 2 3 4 6 7 5 -=> 5.993308284393933E109: 0 1 2 3 4 7 5 6 -=> 2.736734332816085E108: 0 1 2 3 4 7 6 5 -=> 3.973494182787193E115: 0 1 2 3 5 4 6 7 -=> 6.577018684224087E116: 0 1 2 3 5 4 7 6 -=> 3.973494182787193E115: 0 1 2 3 5 6 4 7 -=> 6.577018684224087E116: 0 1 2 3 5 7 4 6 -=> 8.374147544754745E119: 0 1 2 3 6 4 5 7 -=> 8.374147544754745E119: 0 1 2 3 6 4 7 5 -=> 1.0852895218002162E119: 0 1 2 3 6 5 4 7 -=> 8.374147544754745E119: 0 1 2 3 6 7 4 5 -=> 5.993308284393933E109: 0 1 2 3 7 4 5 6 -=> 2.736734332816085E108: 0 1 2 3 7 4 6 5 -=> 1.2945545894290903E112: 0 1 2 3 7 5 4 6 -=> 2.736734332816085E108: 0 1 2 3 7 6 4 5 -=> NaN: 0 1 2 4 5 3 6 7 -=> NaN: 0 1 2 4 5 3 7 6 -=> NaN: 0 1 2 4 5 6 3 7 -=> NaN: 0 1 2 4 5 6 7 3 -=> NaN: 0 1 2 4 5 7 3 6 <snip ~10,000 permutations with cost estimates of "NaN"> -=> NaN: 1 7 6 5 4 2 0 3 -=> NaN: 1 7 6 5 4 2 3 0 -=> NaN: 1 7 6 5 4 3 0 2 -=> NaN: 1 7 6 5 4 3 2 0 -=> Infinity: 2 0 1 3 4 5 6 7 -=> NaN: 2 0 1 3 4 5 7 6 -=> NaN: 2 0 1 3 4 6 5 7 -=> NaN: 2 0 1 3 4 6 7 5 -=> NaN: 2 0 1 3 4 7 5 6 -=> NaN: 2 0 1 3 4 7 6 5 -=> NaN: 2 0 1 3 5 4 6 7 <snip ~30,000 permutations, most with cost estimates of "NaN", a few with estimates of Infinity> -=> NaN: 7 6 5 4 3 1 0 2 -=> NaN: 7 6 5 4 3 1 2 0 -=> NaN: 7 6 5 4 3 2 0 1 -=> NaN: 7 6 5 4 3 2 1 0 = > ENDED because ran out of permutations. -=> 5.607246651648704E62: 0 1 2 3 4 5 6 7 ELAPSED TIME = 1145530 milliseconds I'm still trying to make sense of this data, but perhaps this info will be helpful in tracking down the issue(s) here...
          Bryan Pendleton made changes -
          Attachment plan10_2.txt [ 12346469 ]
          Attachment plan10_1_2_1.txt [ 12346468 ]
          Attachment plans.diff [ 12346470 ]
          Hide
          Bryan Pendleton added a comment -

          Attached are the query plans that I get when running with 10.1.2.1 and 10.2

          I also attached 'plans.diff', which is a diff of the plans after editing the plans to remove "noisy" differences such as the floating point values of the estimated costs.

          The plans are very similar, but not identical.

          It seems that 10.2 is, in several cases, preferring indexed access to tables, which I think shows that the predicate pushdown changes (DERBY-805 and friends) are working.

          Other than that, my extremely coarse reading of the plans is that the primary difference between the two plans is the estimated costs.

          Show
          Bryan Pendleton added a comment - Attached are the query plans that I get when running with 10.1.2.1 and 10.2 I also attached 'plans.diff', which is a diff of the plans after editing the plans to remove "noisy" differences such as the floating point values of the estimated costs. The plans are very similar, but not identical. It seems that 10.2 is, in several cases, preferring indexed access to tables, which I think shows that the predicate pushdown changes ( DERBY-805 and friends) are working. Other than that, my extremely coarse reading of the plans is that the primary difference between the two plans is the estimated costs.
          Hide
          Bryan Pendleton added a comment -

          > So far as I know, the jumping code is working properly.

          Thanks, that's good to know. It actually was an innocent question; I'm
          not holding anything up my sleeve My train of thought was something like:

          • 7+ minutes of CPU time is a lot on today's machines
          • therefore, we're either evaluating an astounding number of query plans,
          • or, we're burning a lot of CPU evaluating the plans,
          • or both
          Show
          Bryan Pendleton added a comment - > So far as I know, the jumping code is working properly. Thanks, that's good to know. It actually was an innocent question; I'm not holding anything up my sleeve My train of thought was something like: 7+ minutes of CPU time is a lot on today's machines therefore, we're either evaluating an astounding number of query plans, or, we're burning a lot of CPU evaluating the plans, or both
          Hide
          A B added a comment -

          Everything written in Bryan's preceding comment sounds correct to me, so I won't dwell. As for the specific questions:

          > I've seen almost no discussion of the first point (that the cheapest actual plan
          > should have the cheapest estimated cost); does this mean that we are pretty
          > confident about this aspect of cost estimation at this point? That is, the cost
          > estimating may be off, but it seems to be off for all queries equally?

          This has been my general assumption about the code, yes-at least, after DERBY-1007 was resolved. It seems to me that at some point in the last year I found a scenario where the optimizer's "best" cost estimate did not (appear to) correspond to the best query plan, but I don't remember the details and it may have ended up being correct after all. In any event, in all of the discussion that I've had/written, my general assumption has been that "Yes", the relative accuracy of the cost estimates is correct-i.e. that better plans have lower cost estimates. Note, though, that this is just an assumption of mine which I have not bothered trying to debunk; if you find info to the contrary, please say so!

          > DERBY-1907 is certainly relevant here; are there other issues logged like this?

          None come to mind, no. But I admit that's an answer based strictly on memory; I didn't actually do any searching...

          > I guess I'm wondering (out loud) whether it is worth investigating a simple tuning
          > of the cost estimation algorithm. If the optimizer was much faster at generating
          > and estimating possible plans, wouldn't that be a big benefit?

          Yes, definitely!

          > Also, how confident are we that permutation jumping (as described in
          > http://wiki.apache.org/db-derby/JoinOrderPermutations) is working properly?

          I had to laugh out loud when I read this question. It sounds to me like the kind of question someone asks when they've found a somewhat serious bug but don't want to rock the boat

          So far as I know, the jumping code is working properly. But if you told me there was a problem with the code, I think I'd assume you were right. Is that the case?

          Show
          A B added a comment - Everything written in Bryan's preceding comment sounds correct to me, so I won't dwell. As for the specific questions: > I've seen almost no discussion of the first point (that the cheapest actual plan > should have the cheapest estimated cost); does this mean that we are pretty > confident about this aspect of cost estimation at this point? That is, the cost > estimating may be off, but it seems to be off for all queries equally? This has been my general assumption about the code, yes- at least, after DERBY-1007 was resolved. It seems to me that at some point in the last year I found a scenario where the optimizer's "best" cost estimate did not (appear to) correspond to the best query plan, but I don't remember the details and it may have ended up being correct after all. In any event, in all of the discussion that I've had/written, my general assumption has been that "Yes", the relative accuracy of the cost estimates is correct -i.e. that better plans have lower cost estimates. Note, though, that this is just an assumption of mine which I have not bothered trying to debunk; if you find info to the contrary, please say so! > DERBY-1907 is certainly relevant here; are there other issues logged like this? None come to mind, no. But I admit that's an answer based strictly on memory; I didn't actually do any searching... > I guess I'm wondering (out loud) whether it is worth investigating a simple tuning > of the cost estimation algorithm. If the optimizer was much faster at generating > and estimating possible plans, wouldn't that be a big benefit? Yes, definitely! > Also, how confident are we that permutation jumping (as described in > http://wiki.apache.org/db-derby/JoinOrderPermutations ) is working properly? I had to laugh out loud when I read this question. It sounds to me like the kind of question someone asks when they've found a somewhat serious bug but don't want to rock the boat So far as I know, the jumping code is working properly. But if you told me there was a problem with the code, I think I'd assume you were right. Is that the case?
          Hide
          Bryan Pendleton added a comment -

          Just speaking abstractly here for a minute, I want to make sure I'm following the discussion.

          As I understand it, the Optimizer's overall top-level algorithm is:

          • Consider every possible query plan for executing this query (modulo permutation "jumping")
          • Evaluate the cost of each such plan.
          • Select the cheapest.
          • If at any time we determine that we've now spent longer evaluating plans
            than it would have taken to execute the cheapest plan seen so far, then
            quit and go run that plan (this is known as OptimizerTimeout)

          Is that an accurate summary? If so, it seems that:

          1) The most important thing about the Optimizer's cost estimation algorithm
          is that it be "relatively accurate": the cheapest actual plan should
          have the lowest estimated cost.

          2) The second most important thing is that the algorithm should strive to
          be accurate, so that the optimizer timeout works effectively.

          3) Another important thing is that the algorithm should be as rapid as
          possible, so that the optimizer can evaluate as many plans as possible
          prior to timing out.

          It seems that most of the discussion so far has been around item (2), regarding
          the accuracy and interpretation (units of measurement) of the cost estimate.
          DERBY-1905, DERBY-1906, DERBY-1908, DERBY-1259, DERBY-1260 all
          have to do with this topic.

          I've seen almost no discussion of the first point (that the cheapest actual plan
          should have the cheapest estimated cost); does this mean that we are pretty
          confident about this aspect of cost estimation at this point? That is, the cost
          estimating may be off, but it seems to be off for all queries equally?

          And I haven't seen much discussion of the third point (that the Optimizer should
          strive to evaluate plans as rapidly as possible, so as to increase its chances of
          coming across a decent plan prior to timeout). DERBY-1907 is certainly relevant
          here; are there other issues logged like this?

          I guess I'm wondering (out loud) whether it is worth investigating a simple tuning
          of the cost estimation algorithm. If the optimizer was much faster at
          generating and estimating possible plans, wouldn't that be a big benefit?

          Also, how confident are we that permutation jumping (as described in
          http://wiki.apache.org/db-derby/JoinOrderPermutations) is working properly?

          Show
          Bryan Pendleton added a comment - Just speaking abstractly here for a minute, I want to make sure I'm following the discussion. As I understand it, the Optimizer's overall top-level algorithm is: Consider every possible query plan for executing this query (modulo permutation "jumping") Evaluate the cost of each such plan. Select the cheapest. If at any time we determine that we've now spent longer evaluating plans than it would have taken to execute the cheapest plan seen so far, then quit and go run that plan (this is known as OptimizerTimeout) Is that an accurate summary? If so, it seems that: 1) The most important thing about the Optimizer's cost estimation algorithm is that it be "relatively accurate": the cheapest actual plan should have the lowest estimated cost. 2) The second most important thing is that the algorithm should strive to be accurate, so that the optimizer timeout works effectively. 3) Another important thing is that the algorithm should be as rapid as possible, so that the optimizer can evaluate as many plans as possible prior to timing out. It seems that most of the discussion so far has been around item (2), regarding the accuracy and interpretation (units of measurement) of the cost estimate. DERBY-1905 , DERBY-1906 , DERBY-1908 , DERBY-1259 , DERBY-1260 all have to do with this topic. I've seen almost no discussion of the first point (that the cheapest actual plan should have the cheapest estimated cost); does this mean that we are pretty confident about this aspect of cost estimation at this point? That is, the cost estimating may be off, but it seems to be off for all queries equally? And I haven't seen much discussion of the third point (that the Optimizer should strive to evaluate plans as rapidly as possible, so as to increase its chances of coming across a decent plan prior to timeout). DERBY-1907 is certainly relevant here; are there other issues logged like this? I guess I'm wondering (out loud) whether it is worth investigating a simple tuning of the cost estimation algorithm. If the optimizer was much faster at generating and estimating possible plans, wouldn't that be a big benefit? Also, how confident are we that permutation jumping (as described in http://wiki.apache.org/db-derby/JoinOrderPermutations ) is working properly?
          Hide
          Bryan Pendleton added a comment -

          Army, thank you for the hint about the optimizer overrides. I will investigate this
          as a possible workaround for the short term.

          Show
          Bryan Pendleton added a comment - Army, thank you for the hint about the optimizer overrides. I will investigate this as a possible workaround for the short term.
          Hide
          A B added a comment -

          Just a random though that occured to me...

          The query in question is written as:

          prepare p1 as '
          select count(distinct t1.c5)
          from table19 t1
          inner join table05 t3 on ( t1.c5 = t3.c2)
          inner join table06 t4 on ( t3.c33 = t4.c1)
          inner join v3 t5 on ( t4.c34 = t5.c1)
          inner join table20 t2 on ( t5.c3 = t2.c5)
          inner join v1 t6 on ( t2.c35 = t6.c1)
          inner join v2 t7 on ( t6.c3 = t7.c5)
          inner join v1 t8 on ( t7.c22 = t8.c1)
          where (((t8.c3 = ? or t8.c3 = ? or t8.c3 = ? or t8.c3 = ?)))';

          Is the use of the explicit "inner join" syntax an intentional decision? Or more specifically, is the intention that the tables are to be joined in a specific order? From the Optimizer standpoint this query becomes one large outer query with 8 Optimizables in the from list, so the Optimizer will attempt to evaluate all combinations of the 8 Optimizables before quitting--which takes a (long) while. If, however, the query was written with a specific join order in mind, you could use Derby optimizer overrides to fix the join order of the 8 Optimizables to be the join order in which they are specified:

          prepare p1 as '
          select count(distinct t1.c5)
          from --DERBY-PROPERTIES joinOrder=FIXED
          table19 t1
          inner join table05 t3 on ( t1.c5 = t3.c2)
          inner join table06 t4 on ( t3.c33 = t4.c1)
          inner join v3 t5 on ( t4.c34 = t5.c1)
          inner join table20 t2 on ( t5.c3 = t2.c5)
          inner join v1 t6 on ( t2.c35 = t6.c1)
          inner join v2 t7 on ( t6.c3 = t7.c5)
          inner join v1 t8 on ( t7.c22 = t8.c1)
          where (((t8.c3 = ? or t8.c3 = ? or t8.c3 = ? or t8.c3 = ?)))';

          When I did this the "prepare' finished in about 3 seconds on the same Red Hat Linux machine with the latest trunk branch (sane build).

          Of course, I still think this issue is one that needs to be resolved. But if a) you're looking for a temprorary workaround and b) you know the order in which you would like the tables to be joined, maybe the optimizer override can help in the short term.

          Please note, though, that Derby optimization as a whole would benefit a lot more if you continued to help resolve this problem

          Show
          A B added a comment - Just a random though that occured to me... The query in question is written as: prepare p1 as ' select count(distinct t1.c5) from table19 t1 inner join table05 t3 on ( t1.c5 = t3.c2) inner join table06 t4 on ( t3.c33 = t4.c1) inner join v3 t5 on ( t4.c34 = t5.c1) inner join table20 t2 on ( t5.c3 = t2.c5) inner join v1 t6 on ( t2.c35 = t6.c1) inner join v2 t7 on ( t6.c3 = t7.c5) inner join v1 t8 on ( t7.c22 = t8.c1) where (((t8.c3 = ? or t8.c3 = ? or t8.c3 = ? or t8.c3 = ?)))'; Is the use of the explicit "inner join" syntax an intentional decision? Or more specifically, is the intention that the tables are to be joined in a specific order? From the Optimizer standpoint this query becomes one large outer query with 8 Optimizables in the from list, so the Optimizer will attempt to evaluate all combinations of the 8 Optimizables before quitting--which takes a (long) while. If, however, the query was written with a specific join order in mind, you could use Derby optimizer overrides to fix the join order of the 8 Optimizables to be the join order in which they are specified: prepare p1 as ' select count(distinct t1.c5) from --DERBY-PROPERTIES joinOrder=FIXED table19 t1 inner join table05 t3 on ( t1.c5 = t3.c2) inner join table06 t4 on ( t3.c33 = t4.c1) inner join v3 t5 on ( t4.c34 = t5.c1) inner join table20 t2 on ( t5.c3 = t2.c5) inner join v1 t6 on ( t2.c35 = t6.c1) inner join v2 t7 on ( t6.c3 = t7.c5) inner join v1 t8 on ( t7.c22 = t8.c1) where (((t8.c3 = ? or t8.c3 = ? or t8.c3 = ? or t8.c3 = ?)))'; When I did this the "prepare' finished in about 3 seconds on the same Red Hat Linux machine with the latest trunk branch (sane build). Of course, I still think this issue is one that needs to be resolved. But if a) you're looking for a temprorary workaround and b) you know the order in which you would like the tables to be joined, maybe the optimizer override can help in the short term. Please note, though, that Derby optimization as a whole would benefit a lot more if you continued to help resolve this problem
          Hide
          A B added a comment -

          > Some items might be:

          > 1) update the store costs to reflect current machines
          > o update the store costs to reflect "actual" machine it is running on
          > o figure out why there is such variance in DERBY-2130
          > o investigate why optimizer costs are so out of line - is it just
          > wrong logic, is it lack of some sort of cost estimate so one bad
          > guess on top of another.
          > o DERBY-47

          > I am interested in doing 1, but I don't think it is going to help
          > with infinite costs.

          > Is anyone else interested in looking at this?

          I am interested both in this particular issue (DERBY-2130) and also in #3: why the optimizer cost estimates are so out of line. I.e. DERBY-1905. I haven't started just yet but I would like to look more into DERBY-1905 to see if I can pull anything up. I hope to start looking at this very soon...

          For what it's worth, I ran the repro attached to this issue with various sets of Derby insane jars, and the timings I saw were as follows. I ran the repro 3 times for each set of jars on Red Hat Linux using ibm142. Unit of measurement is milliseconds (returned via the "elapsedtime on" ij command).

          10.1 Pre-805:

          • 141570
          • 142463
          • 140686

          10.2 Pre-805:

          • 143794
          • 139213
          • 138892

          10.1 latest branch:

          • 43494
          • 37229
          • 37532

          10.2 Post-805, Pre-781 (10.2.0.4, svn 423199)

          10.2 latest branch:

          • 470067
          • 438084
          • 433038

          Things to notice:

          • It looks like the slow-down is not caused by the DERBY-805 changes--at least not directly. On the contrary, the query actually runs quite a bit more quickly with the DERBY-805 changes (and all subsequent patches, esp. DERBY-1777) than it does without. Or put another way, the query runs faster with the latest 10.1 codeline than it does with 10.1.2.1. Of course, it's possible that changes for DERBY-805 are mingling with more recent changes to cause the slowdown, but we at least know that the DERBY-805 changes alone are not the problem.
          • The run at revision 423199 fails with an NPE, which corresponds to Bryan's results when he ran with 10.1.3.1. In both cases the DERBY-805 changes have been committed but the subsequent regression fixes (including DERBY-1777) have not, hence the NPE.
          • I did not see the non-deterministic behavior that Bryan did with 10.2 trunk--instead, I consistently see 7+ minutes. However, I only ran the query 3 times, which is probably not enough to have seen what Bryan saw. Also note that I only ran the DDL one time and then I re-ran the query the second two times. So given Mike's previous comment, this may be the reason I didn't see the variance that Bryan did.

          Note sure how useful this information is, but that's what I've found so far. Still scratching my head and poking around, though, so maybe something will come up. (My inclination is toward DERBY-781, but I don't really know how just yet...)

          Show
          A B added a comment - > Some items might be: > 1) update the store costs to reflect current machines > o update the store costs to reflect "actual" machine it is running on > o figure out why there is such variance in DERBY-2130 > o investigate why optimizer costs are so out of line - is it just > wrong logic, is it lack of some sort of cost estimate so one bad > guess on top of another. > o DERBY-47 > I am interested in doing 1, but I don't think it is going to help > with infinite costs. > Is anyone else interested in looking at this? I am interested both in this particular issue ( DERBY-2130 ) and also in #3: why the optimizer cost estimates are so out of line. I.e. DERBY-1905 . I haven't started just yet but I would like to look more into DERBY-1905 to see if I can pull anything up. I hope to start looking at this very soon... For what it's worth, I ran the repro attached to this issue with various sets of Derby insane jars, and the timings I saw were as follows. I ran the repro 3 times for each set of jars on Red Hat Linux using ibm142. Unit of measurement is milliseconds (returned via the "elapsedtime on" ij command). 10.1 Pre-805: 141570 142463 140686 10.2 Pre-805: 143794 139213 138892 10.1 latest branch: 43494 37229 37532 10.2 Post-805, Pre-781 (10.2.0.4, svn 423199) NPE (caused by DERBY-1777 ). 10.2 latest branch: 470067 438084 433038 Things to notice: It looks like the slow-down is not caused by the DERBY-805 changes--at least not directly. On the contrary, the query actually runs quite a bit more quickly with the DERBY-805 changes (and all subsequent patches, esp. DERBY-1777 ) than it does without. Or put another way, the query runs faster with the latest 10.1 codeline than it does with 10.1.2.1. Of course, it's possible that changes for DERBY-805 are mingling with more recent changes to cause the slowdown, but we at least know that the DERBY-805 changes alone are not the problem. The run at revision 423199 fails with an NPE, which corresponds to Bryan's results when he ran with 10.1.3.1. In both cases the DERBY-805 changes have been committed but the subsequent regression fixes (including DERBY-1777 ) have not, hence the NPE. I did not see the non-deterministic behavior that Bryan did with 10.2 trunk--instead, I consistently see 7+ minutes. However, I only ran the query 3 times, which is probably not enough to have seen what Bryan saw. Also note that I only ran the DDL one time and then I re-ran the query the second two times. So given Mike's previous comment, this may be the reason I didn't see the variance that Bryan did. Note sure how useful this information is, but that's what I've found so far. Still scratching my head and poking around, though, so maybe something will come up. (My inclination is toward DERBY-781 , but I don't really know how just yet...)
          Hide
          Mike Matrigali added a comment -

          The variance in optimizer times is puzzling as I thought the order of plans picked was pretty much the same. I think
          2 things would help in understanding this:
          1) Have a trace flag that tells us if the prepare ended because of timeout or because of plan exhaustion.
          2) Can you reproduce the variance if you only do the prepare on the same db, without redoing the ddl. I think I
          remember some issue with
          the order that tables are looked at being related to hash codes which might vary from ddl to ddl.
          Is this an issue only if entire optmizer is timing out, or are there any "sub-timeouts" where we might short circuit
          a list of subquery options?

          Show
          Mike Matrigali added a comment - The variance in optimizer times is puzzling as I thought the order of plans picked was pretty much the same. I think 2 things would help in understanding this: 1) Have a trace flag that tells us if the prepare ended because of timeout or because of plan exhaustion. 2) Can you reproduce the variance if you only do the prepare on the same db, without redoing the ddl. I think I remember some issue with the order that tables are looked at being related to hash codes which might vary from ddl to ddl. Is this an issue only if entire optmizer is timing out, or are there any "sub-timeouts" where we might short circuit a list of subquery options?
          Hide
          Mike Matrigali added a comment -

          What continues to puzzle me about these issues is that the
          estimated costs are so out of control. Without a major
          rewrite of the optimizer, the basic assumption is that the
          estimated costs are somewhat close to the actual elapsed
          time of the query. Timeout depends on this, and thus
          optimizer "performance" depends on this as we tend to
          loop through ever increasing number of plans as each
          variable is added.

          I think the key is to get the optimizer to time out correctly.
          DERBY-1907 would be nice but the whole system is just broken
          if estimates are so wrong.

          Army posted some "simple" problem cases, that is where I
          would start. It seems like all of the following may apply
          to your case, probably good to validate there is not something
          new going on with your query.:

          DERBY-1908 Investigate: What's the "unit" for optimizer cost estimates?
          DERBY-1907 Improve optimizer to prune query plans more aggressively.
          DERBY-1906 Investigate appropriateness of current optimizer timeout mechanism.
          DERBY-1905 Optimizer cost estimates for subqueries are way (way) too high.
          DERBY-47 IN list optimization

          One approach that would keep the current basic optimizer technology would be to see if we can fix the estimates. Some items might be:

          1) update the store costs to reflect current machines
          o update the store costs to reflect "actual" machine it is running on
          o figure out why there is such variance in DERBY-2130
          o investigate why optimizer costs are so out of line - is it just
          wrong logic, is it lack of some sort of cost estimate so one bad
          guess on top of another.
          o DERBY-47

          I am interested in doing 1, but I don't think it is going to help
          with infinite costs.

          Is anyone else interested in looking at this? The repro for
          2130 is pretty ugly, I wonder if we could agree on a smaller problem
          to work on that hopefully would help the harder problem.

          Show
          Mike Matrigali added a comment - What continues to puzzle me about these issues is that the estimated costs are so out of control. Without a major rewrite of the optimizer, the basic assumption is that the estimated costs are somewhat close to the actual elapsed time of the query. Timeout depends on this, and thus optimizer "performance" depends on this as we tend to loop through ever increasing number of plans as each variable is added. I think the key is to get the optimizer to time out correctly. DERBY-1907 would be nice but the whole system is just broken if estimates are so wrong. Army posted some "simple" problem cases, that is where I would start. It seems like all of the following may apply to your case, probably good to validate there is not something new going on with your query.: DERBY-1908 Investigate: What's the "unit" for optimizer cost estimates? DERBY-1907 Improve optimizer to prune query plans more aggressively. DERBY-1906 Investigate appropriateness of current optimizer timeout mechanism. DERBY-1905 Optimizer cost estimates for subqueries are way (way) too high. DERBY-47 IN list optimization One approach that would keep the current basic optimizer technology would be to see if we can fix the estimates. Some items might be: 1) update the store costs to reflect current machines o update the store costs to reflect "actual" machine it is running on o figure out why there is such variance in DERBY-2130 o investigate why optimizer costs are so out of line - is it just wrong logic, is it lack of some sort of cost estimate so one bad guess on top of another. o DERBY-47 I am interested in doing 1, but I don't think it is going to help with infinite costs. Is anyone else interested in looking at this? The repro for 2130 is pretty ugly, I wonder if we could agree on a smaller problem to work on that hopefully would help the harder problem.
          Hide
          Bryan Pendleton added a comment -

          One experiment I tried was to remove all the secondary
          indexes. I thought that that might shrink the overall "search
          space" that the optimizer was dealing with. However, it
          actually seemed to make matters worse, not better.

          Show
          Bryan Pendleton added a comment - One experiment I tried was to remove all the secondary indexes. I thought that that might shrink the overall "search space" that the optimizer was dealing with. However, it actually seemed to make matters worse, not better.
          Bryan Pendleton made changes -
          Field Original Value New Value
          Attachment repro.sql [ 12346070 ]
          Hide
          Bryan Pendleton added a comment -

          reproduction script is 'repro.sql'.

          Show
          Bryan Pendleton added a comment - reproduction script is 'repro.sql'.
          Bryan Pendleton created issue -

            People

            • Assignee:
              Mike Matrigali
              Reporter:
              Bryan Pendleton
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development