|
Hallo!
This bug does affect 10.1.2.1. Even 10.1.1.0 is affected. I tested with some of the Derby releases and all have the exactly the same (wrong) resultset depending on the join order: Statement10.1.1.0 - (208786)-j1.4.2_10 Statement10.1.2.1 - (330608)-j1.4.2_10 Statement10.1.2.2 - (349064)-j1.4.2_10 Statement10.1.2.2 - (370021)-j1.4.2_10 Statement10.2.2.0 - (485682)-j1.4.2_10 Statement10.3.1.4 - (561794)-j1.4.2_10 In addition I mixed our Joins to all possible combinations and found out e.g. The join order NA0-P00-NA4-N01-N03-NE1-N04-N08-NB0-Q00 gives 924 rows, but NA0-P00-NA4-N01-N03-NE1-N04-N08-Q00-NB0 only 0 rows. (Only last to tables are switched) Comparing the statistics the sucessful SQL with Q00 as right join (Rows seen = 924) Right result set: Hash Scan ResultSet for ESVQ00 at read committed isolation level using instantaneous share table locking: Number of opens = 924 Hash table size = 60 Hash key is column number 0 Rows seen = 924 Rows filtered = 0 constructor time (milliseconds) = 0 open time (milliseconds) = 0 next time (milliseconds) = 0 close time (milliseconds) = 0 next time in milliseconds/row = 0 scan information: Bit set of columns fetched={0, 6} Number of columns fetched=2 Number of pages visited=3 Number of rows qualified=60 Number of rows visited=60 but on the other hand (Rows seen = 0) with NB0 as most right join: Right result set: Hash Scan ResultSet for ESVNB0 at read committed isolation level using instantaneous share table locking: Number of opens = 924 Hash table size = 0 Hash keys are column numbers (1,2) Rows seen = 0 Rows filtered = 0 constructor time (milliseconds) = 0 open time (milliseconds) = 0 next time (milliseconds) = 0 close time (milliseconds) = 0 scan information: Bit set of columns fetched={0, 1, 2, 3, 6, 8} Number of columns fetched=6 Number of pages visited=8 Number of rows qualified=0 Number of rows visited=240 Maybe that helps. I attach the complete results from all versions I tested as derby-02-search-joins2.zip I did some tracing through the execution-time result set processing for "new-style-sql.txt" and while I still do not understand what is going on, I *think* that part of the problem is that the two predicates for the outer-most query, i.e.:
WHERE O4Work.ESVN01.NU_BUY_CPY = 99 AND O4Work.ESVNA0.NU_CPY_GRP = 0 are both being applied as "scan qualifiers" to the initial scans for ESVNB0 and ESVNE1, in addition to being applied to their respective target tables. That is to say, "NU_BUY_CPY = 99" translates into a qualifier on the "first column" in ESVN01 (because NU_BUY_CPY is the first column in that table); but further up the result set tree, that same qualifier is being applied to the "first column" of ESVNB0 and (still later) to the "first column" of ESVNE1--which, if that's what is actually happening, would be wrong... Similarly, "NU_CPY_GRP = 0" translates into a qualifier on the "fourth column" of ESVNA0 (because NU_CPY_GRP is the fourth column in that table); but later that same qualifier is being applied to the "fourth column" of ESVNB0 and (still later) the "fourth column" of ESVNE1. The reason the query returns the correct results (924 rows) if there is only one predicate, "ESVN01.NU_BUY_CPY = 99", is because, as luck would have it, the first column in both ESVNB0 and ESVNE1 is *also* NU_BUY_CPY, and it, like ESVN01, has the value 99 for all rows. So even though the predicate is (apparently) mis-applied, it doesn't affect the results because it doesn't actually filter any rows. But the second predicate, "ESVNA0.NU_CPY_GRP = 0", does affect the results because the fourth column of ESVNB0 and ESVNE1 is "NU_MT1_CPY", and that column does not have any rows with value "0". So when the predicate is mis-applied to ESVNB0, it eliminates all of ESVNB0's rows, thus causing the query to return no results. In an attempt to validate this somewhat bizarre theory, I updated a few of the rows in ESVNB0 to have the value "0" for column NU_MT1_CPY: update o4work.esvnb0 set nu_mt1_cpy = 0 where nu_mt1_cpy = 2; When I re-ran "new-style-sql.txt", the query still returned 0 rows. Further tracing showed that we were now getting some rows back from ESVNB0, but then we were joining them with ESVNE1--and since ESVNE1 still didn't have any NU_MT1_CPY columns with value 0, we were filtering out all of ESVNE1's rows, leading to an empty result set. So I then updated ESVNE1 in a way similar to ESVNB0, ie: update o4work.esvne1 set nu_mt1_cpy = 0 where nu_mt1_cpy = 2; When I did that, "new-style-sql.txt" returned 906 rows. This seems to suggest (though it's far from conclusive) that the NU_CPY_GRP qualifier is being mis-applied in the hash scans for ESVNB0 and ESNB1. A look at the query plans for "new-style-sql_without-condition.txt" and "new-style-sql.txt" seems to suggest a similar thing. Note in particular the scan qualifiers for the hash scans on ESVNB0 and ESVNE1. For "new-style-sql_without-condition", we see a single scan qualifier on the first column: scan qualifiers: Column[0][0] Id: 0 Operator: = Ordered nulls: false Unknown return value: false Negate comparison result: false This corresponds to "NU_BUY_CPY = 99" (verified by tracing, where the qualifier value is '99'), which is always true for all rows of ESVNB0 and ESVNE1. But for "new-style-sql.txt", we see *two* scan qualifiers: scan qualifiers: Column[0][0] Id: 0 Operator: = Ordered nulls: false Unknown return value: false Negate comparison result: false Column[0][1] Id: 3 Operator: = Ordered nulls: false Unknown return value: false Negate comparison result: false The second qualifier is on the fourth column and has a value of '0', which appears to be coming from the "NU_CPY_GRP = 0" predicate. That qualifier is eliminating all rows from the scans (for both ESVNB0 and ESVNE1), leading to zero results. Given this potentially erroneus theory, it appears that somehow the qualifiers for the outer-most WHERE clause are being mapped incorrectly onto ESVNB0 and ESVNE1. Maybe this is a transitive closure computation problem, maybe it's something else entirely. At this point I don't know, I'm just dumping what I've seen thus far... I did some further tracing/investigation of this issue and confirmed that search transitive closure is indeed adding incorrect predicates to the predicate list. In the case of "new-style-sql.txt", we end up with four additional predicates:
ESVNB0.NU_BUY_CPY = 99 ESVNE1.NU_BUY_CPY = 99 ESVNB0.NU_MT1_CPY = 0 ESVNE1.NU_MT1_CPY = 0 All of these predicates are incorrectly inferred from two predicates in the outermost WHERE clause, namely O4Work.ESVN01.NU_BUY_CPY = 99 AND O4Work.ESVNA0.NU_CPY_GRP = 0 It turns out that the introduction of these invalid predicates stems from a problem that is almost identical to the one described in In fact, the fix for that issue nearly solved this one, too. More specifically, the fix for /* It's not enough to just set the table number. Depending * on the original query specified and on whether or not * subquery flattening has occurred, it's possible that * the expression to which we're remapping has a different * result column list than the one to which we were mapped * before we got here. In that case we also need to update * the columnNumber to point to the correct column in "ft". * That's what the following line does. See * details. */ ResultColumn ftRC = ft.getResultColumns().getResultColumn(columnName); columnNumber = ftRC.getColumnPosition(); That code depends on the "getColumnPosition()" method of ResultColumn, which looks as follows: public int getColumnPosition() { if (columnDescriptor!=null) return columnDescriptor.getPosition(); else return virtualColumnId; } The document d2526_v1.html mentioned two things regarding this code: 1) if "columnDescriptor" is null, the method will return virtualColumnId; 2) columnDescriptor will *not* be null if the ResultColumn (ftRC in this case) points directly to a base table. In that case the method will return the position as stored in the column descriptor. Both of those statements are still true. But there's another fact about this code that 3) columnDescriptor may be NON-null *even if* ftRC points to a virtual column. In the case of "new-style-sql" we'll end up calling the above code for both sides of any join predicates that belong to JoinNodes which are flattened (again, see the document attached to In the above code, then, when we call: columnNumber = ftRC.getColumnPosition(); we'll see that columnDescriptor is non-null and thus columnNumber will become the column position w.r.t. the *base* table. But that's not quite correct: ftRC points to a virtual result column list (i.e. the RCL for HalfOuterJoinNode), so columnNumber has to reflect the correct position with respect to that virtual column list. In other words, we need to retrieve the *virtualColumnId*, not the base column position. When we mis-assign the columnNumber for the join predicates, we end up confusing the transitive closure code, which leads to the eventual addition of incorrect predicates to the WHERE list, and therefore we end up with missing rows. I made a slight tweak to the code added by - columnNumber = ftRC.getColumnPosition(); + /* Use the virtual column id if the ResultColumn's expression + * is a virtual column ( + */ + columnNumber = + (ftRC.getExpression() instanceof VirtualColumnNode) + ? ftRC.getVirtualColumnId() + : ftRC.getColumnPosition(); With this one-line change, "new-style-sql" returns the same results as "new-style-sql_without-condition". So that seems like a step in the right direction for this issue. Also with that change I ran the repro program, DerbySearchJoins.java, and all but two of the queries now return 924 rows--another good sign. The two queries that still do not return 924 rows are: old-style-sql.txt (returns 889 rows) new-style-sql-only-inner.txt (returns 2 rows) If it's true that those queries are supposed to return 924 rows, as well (Stefan, can you confirm that? Is that what DB2 shows, as well?), then further investigation is still needed before this issue can be resolved. But at least we're a step closer... I'm attaching a simplified repro for the problem that I just described, along with a patch that includes the one-line change and a corresponding test case for the nightlies. This patch is NOT for commit, though, as I have not yet run derbyall. I did run suites.All and saw 92 failures along with 24 errors, all of which were in upgrade tests. When I re-ran the suite without my changes to ColumnReference the same tests failed, so I'm hoping it's just an environment issue. I ran derbyall on Red Hat Linux with ibm15 and there were no failures.
I realized that the reason the UpgradeTests were failing over the weekend was that I was disconnected and didn't have old release jars on my machine. When I re-ran suites.All (via "ant junit-all") today all tests passed. So I'm re-attaching the changes as "d3023_v2.patch", which is identical to "d3023_notTested_v1.patch"--it just has a different name because the patch has now been tested. Note that committing the patch for this issue will lead to an ASSERTion failure in the repro query "new-style-sql" due to I'm *not* checking the "Patch Available" flag as I will be heading of town this evening and will be unable to handle follow-up comments until I return. But I thought I'd post the patch and the test results now in case someone else wants to drive this to commit in the interim. Also note: still waiting to here back from Stefan on whether "old-style-sql" and "new-style-sql_only-inner" are supposed to return 924 rows, as well. If so, then there's more work to be done here (d3023_v2.patch would be an incremental step in that case). No, old-style-sql and new-style-sql_only-inner are *not* expected to have 924 rows.
(In the old-style-sql is a "LEFT OUTER JOIN O4Work.ESVN07" increases duplicate rows and NU_COT_TYP = 1 decreases them. After removing NU_COT_TYP = 1 and adding a distinct I get 924 rows for old-style-sql ) (In the new-style-sql_only-inner only inner joins forces the less rows). I adapted the program to give more hint about what expected: Test old-style-sql ok: 924 rows Test old-style-sql-without-condition ok: 924 rows Test new-style-sql ERROR: Expected 924 rows but were 0 rows Test new-style-sql-only-inner ok: 2 rows Test new-style-sql_first-inner-joins ok: 924 rows Test new-style-sql_without-condition ok: 924 rows Here it is: DerbySearchJoins.java
Marking "Patch Available" as I am now available to respond to review comments/suggestions on the description of the problem and/or on the suggested d3023_v2.patch. From what I can tell, that patch resolves the issue and causes the various queries posted by Stefan to return the correct number of rows. Stefan, please correct me if I'm wrong...
I did not test the patch but if new-style-sql returns 924 rows it's solved for me.
Having heard no objections to the patch, and based on Stefan's feedback, I committed d3023_v2.patch with svn # 612504:
URL: http://svn.apache.org/viewvc?rev=612504&view=rev Marking the issue as resolved for 10.4. If the tinderbox tests run cleanly over the next couple of days, I will look at porting this change back to 10.3. Tinderbox runs on trunk ran for several days with no apparent fallout from the changes for this issue, so I ported back to 10.3 with a simple merge command:
svn merge -r 612503:612504 https://svn.apache.org/repos/asf/db/derby/code/trunk I ran derbyall and suites.All with ibm142 and there were no new failures. So I committed with svn # 614046: URL: http://svn.apache.org/viewvc?rev=614046&view=rev Updating fix-in to reflect the fact that this is now in 10.3. This issue has fix version 10.4 and is marked with either 'Release note needed' or 'Existing application impact', but does not have a releaseNote.html attached to it. Should it?
Attaching a release note for this issue. I had to use MS Word to create the file and as a result it doesn't pass the "ReleaseNoteReader" tool check--i.e. it's going to have to be "scrubbed" in order for the release note generator to work correctly. I apologize in advance, but I'm hoping someone else can do the scrubbing as it was not as straightforward as I had hoped it would be, and I don't really have time to figure it all out...
Thanks for the release note, Army. I have attached a scrubbed version which passes the checks performed by the ReleaseNoteReader.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
If so, then that's useful information:
DERBY-1681andDERBY-1633were regressions caused by changes forDERBY-805, and the first commit forDERBY-805(there were many) went into 10.2 with svn #381859. The firstDERBY-805changes were then ported back to the 10.1 branch with svn #396211. But that's *after* the 10.1.2.1 release was cut--10.1.2.1 was released at svn # 330608, per:http://db.apache.org/derby/derby_downloads.html#Archived+Official+Releases
So if this bug does actually affect 10.1.2.*1*, then we can say with some degree of certainty that this particular Jira is *not* a regression caused by
DERBY-805(nor any of theDERBY-805follow-up work). That's a good indication that the problem has been around for a while, which helps narrow down the searching field. It is quite possible that the problem resides in the code that was changed forDERBY-1681and/orDERBY-1633, but it would be good to know for sure whether or not those Jiras are actually the cause, or just different problems in the same area of code.I do not have time to try to run the repro against 10.1.2.1 right now, but if you could confirm that the problem does exist there, I do think that would be helpful...