|
[
Permlink
| « Hide
]
Bryan Pendleton added a comment - 04/Apr/07 11:24 PM
derby-2526.sql is a test script which demonstrates the problem. Both SELECT statements should return 1 row, but in fact the first statement returns 1 row and the second returns 0 rows.
Attached is the output of derby.language.logQueryPlan for both queries.
goodQuery.log is the query plan for the query that returns 1 row, correctly I believe. badQuery.log is the query plan for the query that returns 0 rows, wrongly I believe. The two query plans are *substantially* different. They seem to show that the second query is
executed in a totally different fashion than the first. This would seem to rule out a variety of theories and would seem to focus attention on either: - optimizer chose an invalid query plan (one which could not produce the correct result) - or optimizer chose a valid, but different query plan, then something went wrong while executing it In the "good" query plan, table B4 is joined against the UNION, and the result
is then joined against B3. This makes sense, and seems like a reasonable plan. In the "bad" query plan, table B4 is joined against table B3, and the result is then joined against the UNION. This does *not* make sense. The query does not specify any criteria which would support such a join (unless it were a cartesian product I guess, but still I don't understand this join). In this query plan, here's an excerpt from the portion of the plan where table B4 is joined against table B3: Right result set: Table Scan ResultSet for B3 at read committed isolation level using instantaneous share row locking chosen by the optimizer Rows seen = 0 Rows filtered = 0 scan information: Bit set of columns fetched=All Number of columns fetched=4 Number of pages visited=1 Number of rows qualified=0 Number of rows visited=1 The "bad" query plan visits the B3 row, but the B3 row does not qualify, so apparently the query plan specified some sort of strange join criteria for this join between table B4 and table B3, and according to that criteria no rows qualified, causing the end result to be no rows found by the query. Thank you for your investigation of this issue, Bryan. I've been a bit busy so I haven't had time to look at this more, but it sounds like you're doing a great job of narrowing this down.
One suggestion (without having actually run done any investigation of my own): does this problem reproduce if you set "derby.optimizer.noTimeout=true"? If it does, then you could set that property and then step through the code to see if there are any obvious differences in the optimization paths for the two statements. I usually start by putting a break point in the "getNextPermutation()" method of OptimizerImpl and then doing a quick step through of that method for the different queries, to see if any there are any obvious differences in code flows (esp. number of times the method is called). Another technique I use (which is a tad less scientific) is to put System.outs in the "rememberBestCost()" method of OptimizerImpl to print out: a) the join order that is being saved as "best", and b) the cost of that join order. Then add a similar System.out to the "modifyAccessPaths()" method of OptimizerImpl and print out the final value of bestJoinOrder, along with the final cost estimate. If you do that and then run the two different queries, you can often see a difference in plans/costs that helps track down the issue at hand... Hi Army, thanks for the ideas and the encouragement! The problem does appear to
reproduce with derby.optimizer.noTimeout=true, so I'll try some of your suggestions and see how far I get. The query that finds 1 row prints:
Modifying access paths to use join order: 2,1,0, with cost Level2CostEstimateImpl: at 602878, cost == 216.069408, rowCount == 0.25920000000000015, singleScanRowCount == 0.6000000000000001 The query that finds 0 rows prints: Modifying access paths to use join order: 2,0,1, with cost Level2CostEstimateImpl: at 17715159, cost == 186.67111680000002, rowCount == 0.1555200000000001, singleScanRowCount == 1.2000000000000002 I believe that optimizable 2 is table b4, optimizable 1 is the UNION view, and optimizable 0 is table b3, so these results match the other info I was seeing in the query plan dumps. Time to fire up the debugger... One thing that stands out here: estimated row count that is less than 1.0. This means that whenever the row counts are multiplied through, we'll actually estimate fewer rows than we should., which could throw off estimates. I've noticed this oddity a couple of times myself when working with the optimizer--most recently while looking at DERBY-1905.
I've always thought this was strange. I wonder if you put a check for this at the end of FromBaseTable.estimateCost()--ex. set anything with a row count less than 1 to be 1.0--what would happen? Could be a red herring, but should be easy enough to check... @@ -1832,6 +1832,12 @@ /* Put the base predicates back in the predicate list */ currentJoinStrategy.putBasePredicates(predList, baseTableRestrictionList); + +if (costEstimate.rowCount() < 1.0) +{ +// System.out.println("-=- Oops! " + costEstimate.rowCount() + " -- " + costEstimate.singleScanRowCount()); +costEstimate.setCost(costEstimate.getEstimatedCost(), 1.0d, costEstimate.singleScanRowCount()); +} return costEstimate; } Proverbial (and potentialy irrelevant) 2c. With those lines in FromBaseTable, the cost estimates are different, but the bottom line is the same:
Good query plan: Modifying access paths to use join order: 2,1,0 for 3 optimizables, with cost Level2CostEstimateImpl: at 7363880, cost == 401.716, rowCount == 7.2, singleScanRowCount == 0.6000000000000001 Bad query plan: Modifying access paths to use join order: 2,0,1 for 3 optimizables, with cost Level2CostEstimateImpl: at 18796902, cost == 401.616, rowCount == 12.0, singleScanRowCount == 12.0 I'll look into this more later; I've got a few other things to do first. For both the "good" and "bad" queries, the optimizer examines the complete join orders:
- 0,1,2 - 2,0,1 - 2,1,0 In the "good" case, the optimizer assigns the costs: - 1405 - 1405 - 401 And thus selects the join order 2,1,0 In the "bad" case, the optimizer assigns the costs: - 1405 - 401 - 401 And thus selects the join order 2,0,1 This leaves me with two "next questions" to study: 1) Why does the optimizer compute a different cost for the join order 2,0,1 in the two cases? 2) If the only problem is that the optimizer gets the costs wrong, it might pick a poorer query plan, but it still seems like it should pick a *valid* query plan, right? So, why, in the "bad" case where the optimizer uses the join order 2,0,1, does it actually get *incorrect results*? That is, is 2,0,1 actually a valid plan for executing this query? If so, why do we get wrong results? If it's *not* valid, why does the optimizer proceed to fully cost it out, rather than dismissing it during optimization? The optimizer also examines the join orders [0,2,1], [1,0,2], and [1,2,0], but all of those have cost 401 also. I'm just noting this for completeness, I don't think it's relevant to what's going on with this issue.
In the "good" case, the OptimizerImpl's predicateList contains two predicates:
[0]: bvw.c1 = b4.c7 [1]: b3.c8 = bvw.c5 which seems all well and good. But in the "bad" case, the predicateList contains: [0]: b3.c8=b4.c7 [1]: bvw2.c1 = b4.c7 [2]: b3.c8 = bvw2.c5 and while the second and third predicates are fine, that first one is completely bogus. So now I will try to figure out where that extra erroneous predicate is coming from. The erroneous predicate is coming from PredicateList.joinClauseTransitiveClosure,
called by SelectNode.performTransitiveClosure, called by SelectNode.preprocess. PredicateList.joinClauseTransitiveClosure looks meaty, so I'll take a break for a bit and investigate this code anon. joinClauseTransitiveClosure is examining the two predicates:
b3.c8 = bvw2.c5 and bvw2.c1 = b4.c7 However, in the second predicate, the ColumnReference for bvw2.c1 has columnNumber = 5, so joinClauseTransitiveClosure, which matches up the predicates based on tableNumber and columnNumber, thinks that this is the predicate bvw2.c5 = b4.c7 And, therefore, it computes the transitive closure to include the predicate b3.c8 = b4.c7 So next I need to figure out why the columnNumber is wrong in the leftOperand ColumnReference in the BinaryRelationalOperatorNode for bvw2.c1 = b4.c7 In ColumnReference.bindExpression, we fetch and set the column's columnNumber
from the ResultColumn "matchingRC", with the comment: /* Set the columnNumber from the base table. * Useful for optimizer and generation. */ columnNumber = matchingRC.getColumnPosition(); At this point, the ResultColumn instance for column C1 from view BVW2 has a null ColumnDescriptor, but has a virtualColumnId of 5. So next I need to figure out why the ResultColumn instance for BVw2.C1 is getting a virtualColumnId of 5. The ResultColumn instance for BVW2.C1 with virtualColumnId = 5
comes from a 9-element ResultColumnList which is built for the intermediate "table" that is built by JoinNode to represent the joining of B3 with BVW2. This ResultColumnList has the following columns: Name VirtualColumnId b3.c8 1 b3.c9 2 b3.c5 3 b3.c6 4 bvw2.c1 5 bvw2.c2 6 bvw2.c3 7 bvw2.c4 8 bvw2.c5 9 In this context, the use of virtualColumnId=5 to represent BVW2.C1 makes complete sense. So it appears that I can now finally state the problem with some clarity: The problem involves an inconsistency in the handling of ColumnReference objects during PredicateList.joinClauseTransitiveClosure. In this code, we are working with ColumnReference objects whose columnNumber values refer to intermediate join tables, but the code compares those ColumnReference objects against each other as though the columnNumber values applied to the base tables. Note that in the "good" case, the code also mixes join-table-based column numbers
and base-table-based column numbers, but since the columns in the view are in a different order, it doesn't accidentally see a match where the other case did. That is, when the "good" case is being examined in PredicateList.joinClauseTransitiveClosure, it sees b3.c8 = bvw.c5 and bvw.c1 = b4.c7 but in this case the ColumnReference for bvw.c1 is 6, because it was drawn from the 9-element ResultColumnList: [b3.c8,b3.c9,b3.c5,b3.c6,bvw.c5,bvw.c1,bvw.c2,bvw.c3,bvw.c4] So the transitive closure code does not think that bvw.c5 and bvw.c1 refer to the same column, and thus does not "close" the two predicates together. This investigation and analysis has been fantastic, Bryan. I don't have anything to add--it sounds like you've nailed down the issue and have a clear statement of the problem. That's excellent news.
I like the regular updates as you make progress--they are very helpful and also demonstrate the kind of steps developers in general can take to track down problems like this. Thanks a ton for your continued work on this! Thanks much for the encouragement and kind words Army! They are much appreciated.
My observation for tonight is as follows: *First*, we have the following: CursorNode.bindStatement calls SelectNode.bindResultColumns calls FromList.bindResultColumns calls JoinNode.bindResultColumns calls JoinNode.deferredBindExpressions Then, *later*, we have: CursorNode.optimizeStatement calls DMLStatementNode.optimizeStatement calls SelectNode.preprocess calls SelectNode.performTransitiveClosure calls PredicateList.joinClauseTransitiveClosure I'm wondering whether it would be possible to move the call to SelectNode.performTransitiveClosure out of SelectNode.preprocess and into the very start of SelectNode.bindResultColumns. That is, do the transitive closure processing earlier, during bind processing, rather than at the start of optimizing the statement. I'll give this a try and see what my results are... Moving the transitive closure processing earlier fixes this test case, so that
seems promising. I tried running the entire test suite and got about a dozen test failures: - lang/grantRevokeDDL.sql - lang/outerjoin.sql - lang/subqueryFlattening.sql - lang/wisconsin.java - lang/GroupByExpressionTest - lang/DistinctTest - derbynet/csPrepStmt.java - derbynet/testconnection.java - SecurityPolicyReloadingTest - lang/UpdateCursorTest I'm not sure that all of these are related to this change, but the "lang" suite failures are definitely connected. Still, the overall results seem promising, so I think I shall analyze the test differences in more detail and try to understand how the change provoked those failures. I'll attach this first try patch to share the ideas. It is not for commit. I see the csPrepStmt test was just changed, so I think I can probably
remove that from the list of problems caused by my change. I'll focus on the lang suite test diffs to start. Thanks Myrna for helping me sort out those test failures somewhat. Now I think I am down to:
- lang/outerjoin.sql - lang/subqueryFlattening.sql - lang/wisconsin.java - lang/DistinctTest - lang/UpdateCursorTest The first 3 tests contain query plans in their output, and the query plans have changed as a result of my patch. This requires some study, but it is possible that the new results are correct. DistinctTest and UpdateCursorTest appear to be getting assertion failures. I'll continue to study these failures and see what I can learn. I analyzed the DistinctTest failure in more detail, and wrote up my notes in the attached 'DistinctTestNotes.txt'. I'm still not yet sure whether this test failure indicates a problem in my proposed patch, or a weakness in the test.
The UpdateCursorTest failure appears to be in a test (testVirtualMemoryHeap) which is intimately tied to a particular query execution plan, but I'm not sure how the test is forcing that particular query plan to be chosen. I sent a message about that test to the derby-dev list to see if I could get some help.
Just a knee-jerk reaction to the DistinctTest failure: I wonder if the transitive closure is leading to alteration or removal of predicates that break the conditions described here:
http://db.apache.org/derby/docs/dev/tuning/ctuntransform16279.html For example, there is some talk about how Derby will remove equijoin predicates as part of transitive closure at the bottom of this page: http://db.apache.org/derby/docs/dev/tuning/rtuntransform866587.html Maybe one of the equijoin predicates that is removed breaks the condition for DISTINCT elimination? As I said, that's just knee-jerk feedback. Apologies if it proves irrelevant... It's interesting that in all three of the outerjoin.sql, subqueryFlattening, andDistinctTest
test failures, the queries that get different query plans have some similarities: for one thing, the failed query involves a case where there is both a join condition and a simple restriction condition on the same column. Here's the query from outerjoin.sql which gets a different query plan with the patch: select * from tt1 left outer join tt2 on tt1.c1 = tt2.c2 where tt2.c2 = 3 And here's the query from DistinctTest.java which gets a different query plan with the patch: select distinct a.c1 from two a, four b where a.c1 = b.c1 and a.c3 = b.c3 and b.c3 = 1; And here's the query from subqueryFlattening.sql which gets a different query plan with the patch: select * from outer1 o where exists (select * from idx2 i, idx1 where o.c1 = i.c1 and i.c2 = idx1.c1 and i.c2 = 1) Note that in all 3 cases, the column that is being compared against a constant value is also part of the join criteria. I thought that the transitive closure processing was only looking at join conditions, not at simple restriction conditions ("tt2.c2 = 3"), but perhaps I was wrong. > I thought that the transitive closure processing was only looking at join conditions,
> not at simple restriction conditions ("tt2.c2 = 3"), but perhaps I was wrong. There are two types of transitive closure: http://db.apache.org/derby/docs/dev/tuning/ctuntransform37032.html The first is join conditions, as you've noticed. The second is "search conditions" and does indeed take into account predicates like "tt2.c2 = 3", so maybe this is why you are seeing the different plans? Thanks Army and Oystein for the help! The status now seems to be that:
- the UpdateCursor test is unrelated - the other 4 failures all have the property that - with the patch, we get the right results for all the queries in the test suite - but some of the query plans have changed I will study the docs that Army noted, and look closely into the query plans before and after the patch, and see what I can learn. My firstTryPatch is just flat-out wrong. The patch attempted to fix the problem by
move the transitive closure processing from SelectNode.preprocess to SelectNode.bindResultColumns. However, the transitive closure processing tries to process the predicates in the predicate list held in SelectNode.wherePredicates, and that field isn't set up until mid-way through SelectNode.preprocess, by the code: /* Pull apart the expression trees */ if (whereClause != null) { wherePredicates.pullExpressions(numTables, whereClause); whereClause = null; } So until that code has been run, we can't do the transitive closure processing. Now I am contemplating three alternate ideas: 1) move the code which pulls the expressions out of the whereClause into the wherePredicates list into SelectNode.bindResultColumns, though that seems increasingly risky to me. 2) delay the call to JoinNode.deferredBindExpressions, which is where the column renumbering is done, until after the transitive closure work has occurred (that is, instead of moving transitive closure processing earlier, move column renumbering later) 3) rework the code in the transitive closure processing so that it isn't confused by the column renumbering, presumably by working with the original column references, not by the renumbered bound references. So, it's somewhat back to the drawing board, but at least I understand why the original patch proposal failed, which is good progress. I tried various experiments toward alternates #1 and #2 from the previous
comment, and only succeeding in creating non-operational code. These sections of the SelectNode code seem delicate, and the code can't be re-arranged easily; it all seems to work in exactly this order. Now I'm thinking I should investigate the transitive closure code more carefully, to see if there is a way to revise it so that it isn't so sensitive to the column re-ordering but rather can recognize transitive closures based on the original expressions. > Now I'm thinking I should investigate the transitive closure code more carefully, to see
> if there is a way to revise it so that it isn't so sensitive to the column re-ordering ... Hi Bryan, I noticed that it's been a while since you posted anything to this issue. Are you still actively looking into this one? Have you had any luck with your attempts to revise the transitive closure code? I started looking at this issue and, using Bryan's excellent analysis of the problem thus far, was able to track down what I believe is the underlying cause. After reading all of Bryan's comments (thanks Bryan!), I found myself wondering why it was that "bvw2.c1" in the predicate "bvw2.c1 = b4.c7" had a column number w.r.t to the JoinNode while "bvw.c5" in the predicate "b3.c8 = bvw2.c5" had a column number w.r.t to the view (BVW) itself. Given that both column references have the same tableNumber (by the time we get to transitive closure computation), I was expecting that their column numbers would be w.r.t to the same object--i.e. to "BVW". But that did not appear to be the case (as Bryan discovered).
So I did some tracing through the "preprocess()" code and was able to come up with what I think is a reasonable explanation of the problem. It also agrees with (and builds on) all of Bryan's previous findings (I think). I'm attaching the writeup as "d2526_v1.html". If this explanation is correct, the fix is a two-liner--which I'm attaching as d2526_v1.patch. I ran derbyall and suites.All with this change and there were no failures. I also verified that both of the queries in derby-2526 now return the same result (1 row). I have not added any tests for this problem yet, as I'm just looking for validation of the explanation and proposed fix from anyone who might have the time to read it over. The fact that derbyall/suites.All both ran cleanly is encouraging. Many many thanks again to Bryan for all of his effort and analysis. Regardless of whether or not the attached writeup is an accurate assessment of the problem, Bryan's comments have been very helpful. Army, thank you very much! Your description is very clear. It describes exactly the behavior that I was seeing, and your analysis of the remapping feels spot-on. I agree with your fix, and your patch fixes the problem in my environment, too.
Nothing like having another pair of eyes on things! Very good work. I re-assigned the issue to you, assuming that you'll take it the rest of the way at this point. Is that OK? Thank you for reading the write-up and verifying the fix, Bryan. I'm attaching a second version of the patch, d2526_v2.patch, which adds some relevant test cases to the lang/joins.sql test. That test is already run as part of JUnit via the lang.LangScripts class, so no additional JUnit work is required (I don't think).
I re-ran derbyall and suites.All on SUSE Linux with ibm142 as a sanity check, and everything passed. I also verified that if the engine change is backed out, lang/joins.sql will now fail due to incorrect results. Thanks again to Bryan for his extensive analysis of the problem. The _v2 patch is, I think, ready for commit. Patch2 looks great! The test cases are clear and make sense to me.
The patch applies without problems and ran as expected: failed without the code change, and passes with it. +1 for commit from me. Thanks, as always, for your willingness to review, Bryan. I committed the _v2 patch with svn # 542867:
URL: http://svn.apache.org/viewvc?view=rev&rev=542867 Marking issue as resolved for 10.3, though I think it still needs a release note... Release note looks great! Thanks.
A user hit this bug in 10.1 but the symptom was different. Instead of wrong results, they got a message: The repro is attached as TestDerby.java
Exception in thread "main" ERROR 42818: Comparisons between 'CHAR () FOR BIT DATA' and 'CHAR' are not supported. at org.apache.derby.iapi.error.StandardException.newException(Unkno wn Source) at org.apache.derby.impl.sql.compile.BinaryComparisonOperatorNode.b indComparisonOperator (Unknown Source) at org.apache.derby.impl.sql.compile.PredicateList.searchClauseTran sitiveClosure(Unknown Source) at org.apache.derby.impl.sql.compile.SelectNode.performTransitiveCl osure(Unknown Source) at org.apache.derby.impl.sql.compile.SelectNode.preprocess(Unknown Source) at org.apache.derby.impl.sql.compile.DMLStatementNode.optimize(Unkn own Source) at org.apache.derby.impl.sql.compile.CursorNode.optimize(Unknown Source) at org.apache.derby.impl.sql.GenericStatement.prepMinion(Unknown Source) at org.apache.derby.impl.sql.GenericStatement.prepare(Unknown Source) at org.apache.derby.impl.sql.conn.GenericLanguageConnectionContext. prepareInternalStatem ent(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.executeQuery(Unknown Source) I'll look into backporting this fix to 10.2 and 10.1 > I'll look into backporting this fix to 10.2 and 10.1
For what it's worth, there was another bug in a similar area that also affected 10.1 and 10.2, namely, "It turns out that the introduction of these invalid predicates stems from a problem that is almost identical to the one described in to that issue, since pretty much everything in that document applies to this issue, as well." So I wonder if it'd be worth it to port Thanks Army. I'll look into backporting
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||