Derby
  1. Derby
  2. DERBY-1905

Optimizer cost estimates for subqueries are way (way) too high.

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 10.1.3.2, 10.2.1.6, 10.3.1.4
    • Fix Version/s: None
    • Component/s: SQL
    • Urgency:
      Normal
    • Issue & fix info:
      High Value Fix, Repro attached
    • Bug behavior facts:
      Performance

      Description

      If I run the attached repro (pulled from DERBY-1866) with derby.optimizer.noTimeout=true (meaning the optimizer should take all the time it needs to try out all possible plans within the restrictions of the optimizer overrides), the estimated cost shown in derby.log (if I log the query plan) is over 600k and the estimated row count is over 520k.

      If, as the code in OptimizerImpl seems to expect, the unit for a cost estimate is milliseconds, then the optimize here is guessing that the query will take over 10 minutes to execute and will return a half-million rows. But in truth the combined time for compilation AND execution is just a couple of seconds, and only 15 rows are returned.

      That suggests to me a rather serious problem with the optimizer cost estimates for subqueries.

      Among other things this can have a major impact on the optimizer's timeout mechanism for very deeply-nested queries. The optimizer will continue to try out different combinations of indexes/joinOrders/joinStrategies until it either exhausts them all or until the number of milliseconds spent optimizing is greater than the "best cost" estimate so far. In the case of the repro for this issue, the optimizer quickly exhausts all of the options and so it finishes in a fair amount of time.

      But in larger queries where there are far more combinations to try (see esp. queries attached to DERBY-1205, DERBY-1777), these severly inflated cost estimates get very large very quickly (sometimes even reaching infinity--see DERBY-1259, DERBY-1260) and so the optimizer just keeps on optimizing and never times out. The result is that for queries like those in DERBY-1777, the optimizer can spend literally hours optimizing a query which then executes in a matter of seconds.

      I'm still investigating this, but preliminary examination suggests that part of the problem is in some way related to the treatment of "outer costs" by the optimizer--and in particular, it looks like the optimizer is perhaps too aggressive in multiplying cost estimates by row counts pulled from "outer costs". That's just my first guess at the problem, though; there could of course be far more to it than that...

        Issue Links

          Activity

          Hide
          A B added a comment -

          Attaching file referenced in the description for this issue.

          Show
          A B added a comment - Attaching file referenced in the description for this issue.
          Hide
          Bryan Pendleton added a comment -

          Can the code detect, at execution time, how accurate its estimate was?

          If so, it strikes me that it would be interesting to instrument the code, perhaps under
          SanityManager control, so that if it finds out at execution time that it made a
          particularly poor estimate, it dumps some information such as:

          • here's the query plan I chose
          • here's what I estimated it would take
          • here's what it actually took

          Then we could run a variety of tests (to start with, just run derbyall with this instrumentation)
          and see if that gives us some nice examples of estimation logic that we can fix.

          Show
          Bryan Pendleton added a comment - Can the code detect, at execution time, how accurate its estimate was? If so, it strikes me that it would be interesting to instrument the code, perhaps under SanityManager control, so that if it finds out at execution time that it made a particularly poor estimate, it dumps some information such as: here's the query plan I chose here's what I estimated it would take here's what it actually took Then we could run a variety of tests (to start with, just run derbyall with this instrumentation) and see if that gives us some nice examples of estimation logic that we can fix.
          Hide
          A B added a comment -

          > Can the code detect, at execution time, how accurate its estimate was?

          Good question. I ran the query attached to this issue using the ij "cursor" command and based strictly on that, it looks like the query plan is written to log after the query executes and the cursor has been closed.

          This seems to indicate that we still have the query plan information-including the estimated row count and the estimated cost-somewhere accessible after the query has executed. So if there's an easy way to figure out how many rows were actually returned by the cursor and also the time it took to retrieve all of the rows, then it seems like it would indeed be possible to dump some information about the accuracy of the optimizer's cost estimates.

          I don't know enough about when/how the query plan is logged nor about the state of things at that time in processing to say for sure how hard it would be for the code to determine the accuracy of the optimizer's estimates, but I agree that this is something which at least seems feasible and which could be very useful as a first step in tracking down problems with cost estimates...

          Show
          A B added a comment - > Can the code detect, at execution time, how accurate its estimate was? Good question. I ran the query attached to this issue using the ij "cursor" command and based strictly on that, it looks like the query plan is written to log after the query executes and the cursor has been closed. This seems to indicate that we still have the query plan information- including the estimated row count and the estimated cost -somewhere accessible after the query has executed. So if there's an easy way to figure out how many rows were actually returned by the cursor and also the time it took to retrieve all of the rows, then it seems like it would indeed be possible to dump some information about the accuracy of the optimizer's cost estimates. I don't know enough about when/how the query plan is logged nor about the state of things at that time in processing to say for sure how hard it would be for the code to determine the accuracy of the optimizer's estimates, but I agree that this is something which at least seems feasible and which could be very useful as a first step in tracking down problems with cost estimates...
          Hide
          A B added a comment -

          Attaching an incomplete and potentially incorrect patch that should NOT be committed. This is just the result of some playing around that I was doing with cost estimates in the Optimizer in an attempt to resolve this issue. It's possible that the premise/theory behind the changes is incorrect and thus that this approach is not a viable solution. However, I am seeing considerable improvement in the compilation AND execution times for many large queries after applying this patch.

          As one example, I'm seeing the following times before and after applying this patch when running the DERBY-1777 repro script:

          [ Compile times only (repro does not execute the queries ]

          Before After

          ViewerRoot 3.709s | 3.782s
          ViewerSanL0 156.900s | 9.387s
          ViewerOtherL0 0.204s | 0.374s
          ViewerServerL0 0.093s | 0.062s
          ViewerStorageL0 0.186s | 0.145s
          ViewerSanL1 574.928s | 18.915s
          ViewerOtherL1 163.653s | 2.793s
          ViewerServerL1 368.006s | 5.212s
          ViewerSubsystemL1 373.940s | 2.862s
          ViewerTapeLibraryL1 10.633s | 1.693s
          ViewerOtherL2 1368.968s | 64.909s
          ViewerServerL2 2223.887s | 64.595s
          ViewerPropagatedStatus 82.467s | 1.906s
          ViewerSubsystemL2 1790.749s | 70.922s
          ViewerTapeLibraryL2 1738.614s | 67.455s
          ViewerSwitchL2 2098.310s | 38.211s

          The difference in some cases is as much x35.

          I also ran the queries attached to DERBY-1205 before and after applying the patch. Without the patch two queries in particular take 40 and 3 seconds to compile respectively, and executed in 3 to 5 seconds (these numbers are rough; I don't have the exact figures). With the patch they compile in 10 and 3 seconds respectively and execute in under a second.

          And finally, I ran the script attached to DERBY-2130 before and after the patch. Before the patch the query runs in about 550 to 700 seconds; after the patch, it compiles in 5 seconds and executes in less than a second.

          BUT the patch is not by any means complete. When I ran derbyall I saw at least 20 failures, many of which are execution-time NPEs that, oddly enough, result from the changes to the currentCost (which I never would have expected). Also:

          1. I don't really know anything about "sort avoidance costs", so I have no idea if my changes make sense in that regard (I sort of doubt it).

          2. As described more in the discussions for DERBY-2130, I removed an "if-block" from OptimizerImpl that actually causes an infinite loop in lang/innerjoin.sql (I hacked around that before running derbyall). So that has to be resolved.

          3. I don't know if the "theory" behind my changes makes sense.

          Generally speaking there are two big "theories" behind these changes.

          A. In the optimizeIt() methods of UnionNodes, FromTables, and JoinNodes we are multiplying the cost estimates by "outerCost"s row count more times than we should. This part of the patch makes sense to me and I think is probably mostly correct.

          B. We shouldn't be adding up the costs of the optimizables for a join order. At least, not the way we currently do. If I have a join order [ 1 3 2] then the join tree looks like this:

          JOIN[0]
          / \
          JOIN[1] Opt[2]
          / \
          Opt[1] Opt[3]

          The cost of Opt[2] is actually the cost of JOIN[0], which is the cost of joining JOIN[1] with Opt[2]. So why are we adding the cost of JOIN[1] to the cost of JOIN[0]? Isn't the cost of JOIN[1] already factored into the cost of JOIN[0]? So the changes in this patch basically remove the addition of JOIN[1] to JOIN[0] and instead say that the final cost of the join order is simply the cost of JOIN[0].

          This could very well be IN-correct thinking. For one thing, I think it ignores the cost of actually scanning the rows from Opt[1] (and potentially the other opts as well?), which may be a problem. That's why this patch should not be committed, and why further discussion/investigation is required. But for what it's worth, it really speeds things up

          YMMV. If anyone has the time/inclination to pick this up and continue working with it (I am not actively pursuing this at the moment), please feel free!

          Show
          A B added a comment - Attaching an incomplete and potentially incorrect patch that should NOT be committed. This is just the result of some playing around that I was doing with cost estimates in the Optimizer in an attempt to resolve this issue. It's possible that the premise/theory behind the changes is incorrect and thus that this approach is not a viable solution. However, I am seeing considerable improvement in the compilation AND execution times for many large queries after applying this patch. As one example, I'm seeing the following times before and after applying this patch when running the DERBY-1777 repro script: [ Compile times only (repro does not execute the queries ] Before After ViewerRoot 3.709s | 3.782s ViewerSanL0 156.900s | 9.387s ViewerOtherL0 0.204s | 0.374s ViewerServerL0 0.093s | 0.062s ViewerStorageL0 0.186s | 0.145s ViewerSanL1 574.928s | 18.915s ViewerOtherL1 163.653s | 2.793s ViewerServerL1 368.006s | 5.212s ViewerSubsystemL1 373.940s | 2.862s ViewerTapeLibraryL1 10.633s | 1.693s ViewerOtherL2 1368.968s | 64.909s ViewerServerL2 2223.887s | 64.595s ViewerPropagatedStatus 82.467s | 1.906s ViewerSubsystemL2 1790.749s | 70.922s ViewerTapeLibraryL2 1738.614s | 67.455s ViewerSwitchL2 2098.310s | 38.211s The difference in some cases is as much x35. I also ran the queries attached to DERBY-1205 before and after applying the patch. Without the patch two queries in particular take 40 and 3 seconds to compile respectively, and executed in 3 to 5 seconds (these numbers are rough; I don't have the exact figures). With the patch they compile in 10 and 3 seconds respectively and execute in under a second. And finally, I ran the script attached to DERBY-2130 before and after the patch. Before the patch the query runs in about 550 to 700 seconds; after the patch, it compiles in 5 seconds and executes in less than a second. BUT the patch is not by any means complete. When I ran derbyall I saw at least 20 failures, many of which are execution-time NPEs that, oddly enough, result from the changes to the currentCost (which I never would have expected). Also: 1. I don't really know anything about "sort avoidance costs", so I have no idea if my changes make sense in that regard (I sort of doubt it). 2. As described more in the discussions for DERBY-2130 , I removed an "if-block" from OptimizerImpl that actually causes an infinite loop in lang/innerjoin.sql (I hacked around that before running derbyall). So that has to be resolved. 3. I don't know if the "theory" behind my changes makes sense. Generally speaking there are two big "theories" behind these changes. A. In the optimizeIt() methods of UnionNodes, FromTables, and JoinNodes we are multiplying the cost estimates by "outerCost"s row count more times than we should. This part of the patch makes sense to me and I think is probably mostly correct. B. We shouldn't be adding up the costs of the optimizables for a join order. At least, not the way we currently do. If I have a join order [ 1 3 2] then the join tree looks like this: JOIN [0] / \ JOIN [1] Opt [2] / \ Opt [1] Opt [3] The cost of Opt [2] is actually the cost of JOIN [0] , which is the cost of joining JOIN [1] with Opt [2] . So why are we adding the cost of JOIN [1] to the cost of JOIN [0] ? Isn't the cost of JOIN [1] already factored into the cost of JOIN [0] ? So the changes in this patch basically remove the addition of JOIN [1] to JOIN [0] and instead say that the final cost of the join order is simply the cost of JOIN [0] . This could very well be IN-correct thinking. For one thing, I think it ignores the cost of actually scanning the rows from Opt [1] (and potentially the other opts as well?), which may be a problem. That's why this patch should not be committed, and why further discussion/investigation is required. But for what it's worth, it really speeds things up YMMV. If anyone has the time/inclination to pick this up and continue working with it (I am not actively pursuing this at the moment), please feel free!
          Hide
          Knut Anders Hatlen added a comment -

          Triaged for 10.5.2.

          Show
          Knut Anders Hatlen added a comment - Triaged for 10.5.2.

            People

            • Assignee:
              Unassigned
              Reporter:
              A B
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development