|
1. copy the attached "viewer" folder to your derby folder
2. launch derby on port 9002 3. java -cp .;./derbyclient-10.2.2.0.jar SkipScan You will see the exception in the console. The exception will NOT be thrown if you modify the java file to hard-code those numbers (4 and 100) in the query or if you run the query from a sql console window. Also, if you add an index on dicom_send_requests.send_date and run the same program, the exception will not be thrown. If you turn on execution plan logging, you will see that the exception scenario has a different execution plan from all other scenarios. Thank you, Haolan! I was able to reproduce the error with your repro!
More data points:
- happens with embedded driver as well - happens when database is running in soft upgrade to 10.3 - happens in 10.3 after hard upgrade - happens when using ij/embedded with prepared statement thus: ij> connect 'jdbc:derby:viewer;user=view;password=view'; ij> set schema viewer; ij> prepare foo as 'select study_id, number_of_images from (select distinct st.study_id, st.number_of_images, dsr.priority, st.creation_datetime from dicom_send_requests dsr, studies st where dsr.send_date is null and dsr.workstation_id = ? and dsr.study_id = st.study_id and not exists ( select 1 from dispatcher_locks where dispatcher_locks.study_id = st.study_id and dispatcher_locks.workstation_id = ? and dispatcher_locks.dispatcher_id = ? )) temp'; ij> execute foo using 'values (4,4,100)'; ERROR XJ001: Java exception: ': java.lang.NullPointerException' Without prepared statement (direct execution) there is no error. I will leave this one the the execution experts ;-) Hi Dag, since you have it reproduced, can you re-post the stack trace with the actual line numbers, and maybe also post the line of source code where the NPE actually occurs? Thanks!
Sure, Bryan!
As far as I can the offending line is is NoPutResultSetImpl.java:392 at head of 10.3 branch, (svn 571351): 392: if (startPosition.getColumn(position + 1).isNull()) Uploading stack trace "d3033-sane-ij-session-10.3.1.5.txt" with sysinfo and ij session showing the repro with trace w/line numbers. Thanks Dag. This is an interesting line of code, and an interesting stack trace, particularly given the observation that when the original reporter added an index, the problem went away. I'm wondering if this problem is related to DERBY-2916 and DERBY-2775.
Does the query in the bug involve an OUTER JOIN? or contain an IS [NOT] NULL condition? And is this condition related to the index that was added which causes the problem to go away? At the point where TableScanResultSet.openCore calls NoPutResultSetImpl.skipScan, there is the following comment: // Check whether there are any comparisons with unordered nulls // on either the start or stop position. If there are, we can // (and must) skip the scan, because no rows can qualify It seems like DERBY-2775 may have caused us to take some new paths through the code involving "ordered null semantics", and the skipScan method is at the core of that code (see the DERBY-2916 comments for more discussion of this), so I have an intuition that this problem is related to those changes. For now, I'll link this issue to those issues, in the hope that my intuition is well-founded. Yes, you are correct. The query contains this predicate
"where dsr.send_date is null and .." and column "send_date" is the column for which the index was added to make the problem go away. I don't have a tool to open the rar file for this issue. I was wondering if someone could repackage it as a zip or jar file. I'd like to try it on 10.1 to see if it is a regression.
Repacked to test.zip.
I think this is a 10.2 database, so you may need to recreate the database with dblook and export import to get a repro for 10.1, not sure. Deleted first attchement. Reattaching without ASF license since original was without.
Looks like this is a regression in 10.2. The results with 10.1.2.1 are
> java SkipScan dfdfddd I will attach the 10.1 database which I migrated back using ddlutils v 10.1 database. Issue does not reproduce with 10.1.2.1
This regression occurred with revision 423754 from
A little more historical information on this issue.
1) Army suggested I try backing all the way back to revision 81858 before 2) I printed the plans before/after I was wondering in the skipScan code, should the column values of startPosition ever be null? In this case column[0] is null, but I was wondering if that is ever expected.
startPosition IndexRow (id=136) column DataValueDescriptor[1] (id=149) [0] null ncols 1 orderedNulls boolean[1] (id=150) [0] false Just to try to understand the impact I tried this small change, which interestingly caused the test case and suiltes.All to pass, but I don't really understand what it means for the column value to be null. I am guessing this is not the correct solution but rather there is some sort of problem in generating the column value for startPosition. Just want to check though in case there is a circumstance where the column value should be null and this change is ok. Index: java/engine/org/apache/derby/impl/sql/execute/NoPutResultSetImpl.java =================================================================== --- java/engine/org/apache/derby/impl/sql/execute/NoPutResultSetImpl.java (revision 574932) +++ java/engine/org/apache/derby/impl/sql/execute/NoPutResultSetImpl.java (working copy) @@ -388,7 +388,8 @@ { if (startKeyLonger) checkNullCols[cncLen++] = position + 1; - if (startPosition.getColumn(position + 1).isNull()) + DataValueDescriptor d = startPosition.getColumn(position +1); + if (d == null || d.isNull()) { returnValue = true; if (! startKeyLonger) Thanks for the help. I am certainly out of my element with this code, so appreciate any help or pointers anyone can offer. Kathey Hi Kathey,
My guess (and it's just a guess) is that, while startPosition itself may be null, it probably is not correct for a non-null startPosition to have startPosition[i] be null for any i. For what it's worth: startPosition in the above code comes from TableScanResultSet.startPosition, which is in turn loaded form TableScanResultSet.startKeyGetter. The startKeyGetter is, I believe, generated from PredicateList.generateStartKey(). So if you are able to trace through that method to see which predicate is causing the "null" to be generated, that might be a good starting point. Perhaps the short-circuit logic added for In BaseActivation I see
protected final DataValueDescriptor getColumnFromRow(int rsNumber, int colId) throws StandardException { if( row[rsNumber] == null) { /* This actually happens. NoPutResultSetImpl.clearOrderableCache attempts to prefetch invariant values * into a cache. This fails in some deeply nested joins. See Beetle 4736 and 4880. */ return null; } return row[rsNumber].getColumn(colId); } I am not sure why prefetching invariant values would cause them to be null, but looks like a good lead. FYI: Here is the case from bug 4736 that was giving an NPE in getColumnFromRow before the fix to check for a null row was added. I tend to think that fix was wrong, that there is some root cause for the null row which needs to be addressed instead, and hopefully that is the same root cause for the
Kathey create table a (a1 int not null primary key, a2 int, a3 int, a4 int, a5 int, a6 int); create table b (b1 int not null primary key, b2 int, b3 int, b4 int, b5 int, b6 int); create table c (c1 int not null, c2 int, c3 int not null, c4 int, c5 int, c6 int); create table d (d1 int not null, d2 int, d3 int not null, d4 int, d5 int, d6 int); alter table c add primary key (c1,c3); alter table d add primary key (d1,d3); insert into a values (1,1,3,6,NULL,2),(2,3,2,4,2,2),(3,4,2,NULL,NULL,NULL), (4,NULL,4,2,5,2),(5,2,3,5,7,4),(7,1,4,2,3,4), (8,8,8,8,8,8),(6,7,3,2,3,4); insert into b values (6,7,2,3,NULL,1),(4,5,9,6,3,2),(1,4,2,NULL,NULL,NULL), (5,NULL,2,2,5,2),(3,2,3,3,1,4),(7,3,3,3,3,3),(9,3,3,3,3,3); insert into c values (3,7,7,3,NULL,1),(8,3,9,1,3,2),(1,4,1,NULL,NULL,NULL), (3,NULL,1,2,4,2),(2,2,5,3,2,4),(1,7,2,3,1,1),(3,8,4,2,4,6); insert into d values (1,7,2,3,NULL,3),(2,3,9,1,1,2),(2,2,2,NULL,3,2), (1,NULL,3,2,2,1),(2,2,5,3,2,3),(2,5,6,3,7,2); select a1,b1,c1,c3,d1,d3 from D join (A left outer join (B join C on b2=c2) on a1=b1) on d3=b3 and d1=a2; A1 |B1 |C1 |C3 |D1 |D3 ---------------------------------------------------------------- ------------------------------- 1 |1 |4 |2 |1 |2 ERROR 38000: The exception 'java.lang.NullPointerException' was thrown while evaluating an expression. ERROR 38000: The exception 'java.lang.NullPointerException' was thrown while evaluating an expression. at com.ibm.db2j.protocol.Database.Errors.StandardException.newExcep tion(StandardException.java:386) at com.ibm.db2j.protocol.Database.Errors.StandardException.unexpect edUserException(StandardException.java:567) at com.ibm.db2j.impl.BasicServices.ClassManager.Reflect.DirectCall. invoke(ReflectGeneratedClass.java:118) at com.ibm.db2j.impl.Database.Language.Execution.Generic.GenericQua lifier.getOrderable(GenericQualifier.java:87) at com.ibm.db2j.impl.Database.Language.Execution.Generic.HashScanRe sultSet.getNextRowCore(HashScanResultSet.java:477) at com.ibm.db2j.impl.Database.Language.Execution.Generic.NestedLoop JoinResultSet.getNextRowCore(NestedLoopJoinResultSet.java:116) at com.ibm.db2j.impl.Database.Language.Execution.Generic.ProjectRes trictResultSet.getNextRowCore(ProjectRestrictResultSet.java:249) at com.ibm.db2j.impl.Database.Language.Execution.Generic.BasicNoPut ResultSetImpl.getNextRow(BasicNoPutResultSetImpl.java:479) at com.ibm.db2j.impl.Connectivity.JDBC.Local.LocalResultSet.movePos ition(LocalResultSet.java:297) at com.ibm.db2j.impl.Connectivity.JDBC.Local.LocalResultSet.next(Lo calResultSet.java:251) at com.ibm.db2j.util.JDBCDisplayUtil.indent_DisplayResults(JDBCDisp layUtil.java:335) at com.ibm.db2j.util.JDBCDisplayUtil.indent_DisplayResults(JDBCDisp layUtil.java:272) at com.ibm.db2j.util.JDBCDisplayUtil.DisplayResults(JDBCDisplayUtil .java:261) at com.ibm.db2j.tools.ijImpl.utilMain.displayResult(utilMain.java:3 66) at com.ibm.db2j.tools.ijImpl.utilMain.doCatch(utilMain.java:419) at com.ibm.db2j.tools.ijImpl.utilMain.go(utilMain.java:295) at com.ibm.db2j.tools.ijImpl.Main.go(Main.java:196) at com.ibm.db2j.tools.ijImpl.Main.mainCore(Main.java:163) at com.ibm.db2j.tools.ijImpl40.Main.main(Main.java:49) at com.ibm.db2j.tools.ij.main(ij.java:57) ERROR XJ001: Java exception: ': java.lang.NullPointerException'. java.lang.NullPointerException at com.ibm.db2j.impl.Database.Language.Execution.Generic.BaseActiva tion.getColumnFromRow(BaseActivation.java:1234) at db2j.exe.ac601a400fx00f0xe017x7cc0x007f0000010012c.e8(Unknown Source) at com.ibm.db2j.impl.BasicServices.ClassManager.Reflect.DirectCall. invoke(ReflectGeneratedClass.java:109) at com.ibm.db2j.impl.Database.Language.Execution.Generic.GenericQua lifier.getOrderable(GenericQualifier.java:87) at com.ibm.db2j.impl.Database.Language.Execution.Generic.HashScanRe sultSet.getNextRowCore(HashScanResultSet.java:477) at com.ibm.db2j.impl.Database.Language.Execution.Generic.NestedLoop JoinResultSet.getNextRowCore(NestedLoopJoinResultSet.java:116) at com.ibm.db2j.impl.Database.Language.Execution.Generic.ProjectRes trictResultSet.getNextRowCore(ProjectRestrictResultSet.java:249) at com.ibm.db2j.impl.Database.Language.Execution.Generic.BasicNoPut ResultSetImpl.getNextRow(BasicNoPutResultSetImpl.java:479) at com.ibm.db2j.impl.Connectivity.JDBC.Local.LocalResultSet.movePos ition(LocalResultSet.java:297) at com.ibm.db2j.impl.Connectivity.JDBC.Local.LocalResultSet.next(Lo calResultSet.java:251) at com.ibm.db2j.util.JDBCDisplayUtil.indent_DisplayResults(JDBCDisp layUtil.java:335) at com.ibm.db2j.util.JDBCDisplayUtil.indent_DisplayResults(JDBCDisp layUtil.java:272) at com.ibm.db2j.util.JDBCDisplayUtil.DisplayResults(JDBCDisplayUtil .java:261) at com.ibm.db2j.tools.ijImpl.utilMain.displayResult(utilMain.java:3 66) at com.ibm.db2j.tools.ijImpl.utilMain.doCatch(utilMain.java:419) at com.ibm.db2j.tools.ijImpl.utilMain.go(utilMain.java:295) at com.ibm.db2j.tools.ijImpl.Main.go(Main.java:196) at com.ibm.db2j.tools.ijImpl.Main.mainCore(Main.java:163) at com.ibm.db2j.tools.ijImpl40.Main.main(Main.java:49) at com.ibm.db2j.tools.ij.main(ij.java:57) ij> Hi Kathey, thanks for the info about bug 4736.
I took your pasted code and added 4 drop table statements to the front of it so I could run it repeatedly; it's attached as bug4736.sql. I ran that code repeatedly with my debugger set with a breakpoint in BaseActivation.getColumnFromRow() with a breakpoint set when row[rsNumber] was null (i.e., inside the "if" that you noted). Unfortunately, I never hit the breakpoint. I also tried removing that "if" statement from getColumnFromRow(), to see if the bug4736.sql would then provoke a NullPointerException, and that didn't happen either. So, whatever was going on with that "if" test, the bug4736.sql script no longer drives the conditions which cause row[rsNumber] to be null. I think you've got a good theory, though, and I'll continue to play with it, to try to understand that "if" statement in getColumnFromRow() better. I removed the mysterious "if" statement from BaseActivation.getColumnFromRow(), and ran
all the regression tests. There were two failures, but I'm not sure if they were due to this change or not. I'll investigate them further. I did *not* get any NullPointerException messages, which is what I was sort of expecting to see. So if removing the "if" statement from BaseActivation.java fixes this problem, it apparently isn't going to cause a whole lot of other problems. I think the next step is to investigate whether removing the "if" statement makes the repro script pass. I played around with the skipScan repro case and confirmed Kathey's suspicion about
BaseActivation.getColumnFromRow, as follows: - with the current trunk, the repro case gives me the expected NPE in skipScan. - with the "if" statement removed from BaseActivation.getColumnFromRow, the repro case gives me a NPE which matches the NPE that Kathey posted from the old bug 4736. So this reproduction case definitely takes the code through a similar code path as that of bug 4736. Here's the stack trace I get with the current trunk with the "if" removed from getColumnFromRow: 2007-09-18 00:18:51.793 GMT Thread[DRDAConnThread_2,5,main] (XID = 352997), (SESSIONID = 0), (DATABASE = viewer), (DRDAID = NF000001.B54B-810083792898537651{1}), Failed Statement is: select study_id, number_of_images from (select distinct st.study_id, st.number_of_images, dsr.priority, st.creation_datetime from dicom_send_requests dsr, studies st where dsr.send_date is null and dsr.workstation_id = ? and dsr.study_id = st.study_id and not exists ( select 1 from dispatcher_locks where dispatcher_locks.study_id = st.study_id and dispatcher_locks.workstation_id = ? and dispatcher_locks.dispatcher_id = ? ) ) temp with 3 parameters begin parameter #1: 4 :end parameter begin parameter #2: 4 :end parameter begin parameter #3: 100 :end parameter ERROR 38000: The exception 'java.lang.NullPointerException' was thrown while evaluating an expression. at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:294) at org.apache.derby.iapi.error.StandardException.unexpectedUserException(StandardException.java:554) at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:164) at org.apache.derby.impl.sql.execute.TableScanResultSet.openCore(TableScanResultSet.java:258) at org.apache.derby.impl.sql.execute.IndexRowToBaseRowResultSet.openCore(IndexRowToBaseRowResultSet.java:225) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(ProjectRestrictResultSet.java:168) at org.apache.derby.impl.sql.execute.JoinResultSet.openRight(JoinResultSet.java:272) at org.apache.derby.impl.sql.execute.JoinResultSet.openCore(JoinResultSet.java:151) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(ProjectRestrictResultSet.java:168) at org.apache.derby.impl.sql.execute.SortResultSet.openCore(SortResultSet.java:248) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(ProjectRestrictResultSet.java:168) at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.open(BasicNoPutResultSetImpl.java:248) at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:370) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1225) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1649) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.execute(EmbedPreparedStatement.java:1304) at org.apache.derby.impl.drda.DRDAStatement.execute(DRDAStatement.java:666) at org.apache.derby.impl.drda.DRDAConnThread.processCommands(DRDAConnThread.java:824) at org.apache.derby.impl.drda.DRDAConnThread.run(DRDAConnThread.java:275) Caused by: java.lang.NullPointerException at org.apache.derby.impl.sql.execute.BaseActivation.getColumnFromRow(BaseActivation.java:1317) at org.apache.derby.exe.ac601a400fx0115x15fbx3163xffff9ef86e6e1.e7(Unknown Source) at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:153) ... 16 more ============= begin nested exception, level (1) =========== java.lang.NullPointerException at org.apache.derby.impl.sql.execute.BaseActivation.getColumnFromRow(BaseActivation.java:1317) at org.apache.derby.exe.ac601a400fx0115x15fbx3163xffff9ef86e6e1.e7(Unknown Source) at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:153) at org.apache.derby.impl.sql.execute.TableScanResultSet.openCore(TableScanResultSet.java:258) at org.apache.derby.impl.sql.execute.IndexRowToBaseRowResultSet.openCore(IndexRowToBaseRowResultSet.java:225) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(ProjectRestrictResultSet.java:168) at org.apache.derby.impl.sql.execute.JoinResultSet.openRight(JoinResultSet.java:272) at org.apache.derby.impl.sql.execute.JoinResultSet.openCore(JoinResultSet.java:151) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(ProjectRestrictResultSet.java:168) at org.apache.derby.impl.sql.execute.SortResultSet.openCore(SortResultSet.java:248) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(ProjectRestrictResultSet.java:168) at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.open(BasicNoPutResultSetImpl.java:248) at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:370) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1225) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1649) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.execute(EmbedPreparedStatement.java:1304) at org.apache.derby.impl.drda.DRDAStatement.execute(DRDAStatement.java:666) at org.apache.derby.impl.drda.DRDAConnThread.processCommands(DRDAConnThread.java:824) at org.apache.derby.impl.drda.DRDAConnThread.run(DRDAConnThread.java:275) > I played around with the skipScan repro case and confirmed Kathey's suspicion
> about BaseActivation.getColumnFromRow Thank you for your continued investigation of this issue, Bryan. It's great to have an idea of the codepath in question here. Have you had the time/inclination to look into the cause of the null value in row[rsNumber]? I realize that's easy to say and quite hard to do, but I was just wondering if you've looked at it, or if you have the time and plans to do so? In particular, I'm referring to the code comment that Kathey mentioned: "NoPutResultSetImpl.clearOrderableCache attempts to prefetch invariant values into a cache. This fails in some deeply nested joins." I think the core question here is *why* does "this fail in some deeply nested joins"? That sounds awfully suspicious. It seems that either a) the "failure" is acceptable and the comments need to be updated, in which case Kathey's "quick fix" of checking for the null value would be appropriate; or b) the "failure" is a bug and in order to solve this issue, that failure should (ideally) be fixed... Would you agree? Hi Army, thanks for the feedback! I marked the issue as assigned to me, because I'm trying to understand it, but progress is slow.
I believe that the null value in row[rsNumber] is coming from a request to fetch column 1 of result set 5. Here's the original query for context: select study_id, number_of_images from (select distinct st.study_id, st.number_of_images, dsr.priority, st.creation_datetime from dicom_send_requests dsr, studies st where dsr.send_date is null and dsr.workstation_id = ? and dsr.study_id = st.study_id and not exists ( select 1 from dispatcher_locks where dispatcher_locks.study_id = st.study_id and dispatcher_locks.workstation_id = ? and dispatcher_locks.dispatcher_id = ? )) temp Result set 5, I believe, is the "select 1 from dispatcher_locks" nested query, and column 1 of that result set is dispatcher_locks.study_id. The null value occurs at the very beginning of the first time that we are opening a TableScanResultSet. We are setting up a NestedLoopJoinResultSet, and the right side of that join is a index-to-base probe, and so we are trying to set the start position for a particular "probe" into the index for a particular study_id value. sameStartStopPosition is true, as the join on study_id is a simple equality join. The code in question, near the top of TableScanResultSet.openCore(), is: if (startKeyGetter != null) { startPosition = (ExecIndexRow) startKeyGetter.invoke(activation); if (sameStartStopPosition) { stopPosition = startPosition; } } if (stopKeyGetter != null) { stopPosition = (ExecIndexRow) stopKeyGetter.invoke(activation); } The TableScanResultSet instance that we are processing is for an index-to-base probe of the dicom_send_requests table via the sendreq_study_id_fk secondary index. I'm thinking that, although the external form of the SQL SELECT statement is: - dicom_send_requests joins to studies on study_id, and - studies joins to dispatcher_locks on study_id the optimizer may have re-arranged this and selected a different join order so that the actual query is more like: - dispatcher_locks joins to dicom_send_requests on study_id, and - dicom_send_requests joins to studies on study_id And then, perhaps because the dispatcher_locks query is a "select 1" query, the execution of the query against dispatcher_locks hasn't fetched any actual data, and so the column is coming back null. So far, however, I haven't seen any code path that is going through NoPutResultSetImpl.clearOrderableCache, so I don't any information to share regarding the code comment that Kathey observed. > I haven't seen any code path that is going through
> NoPutResultSetImpl.clearOrderableCache, so I don't > have any information to share regarding the code > comment [...] It's possible that the comment is just a red herring--if clearOrderableCache() is not in play, then there's probably no point going further down that road. The information that you provided above certainly has potential--great investigation. > the optimizer may have re-arranged this To add to this: I think that for the optimizer to re-arrange the join order between dispatcher_locks, dicom_send_requests, and studies, all of those result sets must (I *think*?) appear in the same compile-time FromList. That means that the NOT EXISTS subquery will have to be "flattened" into the "select distinct st.study_id, ..." query. So it may be useful to look at SubqueryNode.preprocess() to see a) if the NOT EXISTS subquery is in fact being flattened, and b) if the column references that result from that flattening are set up correctly. If the NOT EXISTS subquery is in fact being flattened, it might be good to stop the debugger just after "modification of access paths" has occured on the relevant FromList (the one containing dispatcher_locks, dicom_send_requests, and studies) to see what join order (and join strategies) the optimizer has chosen. Perhaps that'll provide some insight as to whether or not the optimizer re-arrangement is in play here...? Of course I'm just offering top-of-my-head suggestions; feel free to ignore and continue with your current path of investigation, as that certainly seems to be yielding some very useful information... Thanks for the pointer to SubqueryNode.preprocess; we're definitely flattening this query.
The code goes through the flattenToExistsJoin() path on line 742 of SubqueryNode.java. And, to substantiate that more, it's clear from close examination of the data structures in the debugger that the overall join tree at execution time is: (Studies NestedLoop Dispatcher_locks) NestedLoop dicom_send_requests And at the point of the crash we've: - fetched the first row from Studies, - joined it against Dispatcher_locks (discovering that no matching row exists, so NOT EXISTS is true) - taken the resulting row and are ready to join it against dicom_send_requests. I'm wondering if the result set book-keeping gets messed up during flattening. At the point that we call getColumnFromRow() to fetch a value from result set 5, it looks to me like the information in row[3] is the correct information that we should be accessing. That is, when I back up the stack to the NestedLoopJoinResultSet frame, and look at the value of "leftRow", which is the row that (I think) should be providing the values that we are about to use for the start/stop position computations in the index scan positioning, the leftRow instance from that join matches up with row[3] in the BaseActivation generated instance. So I'll look at how the result_set numbering is handled during flattening. I've been trying to understand the structure and content of the generated code for
the activation, by using derby.debug.true=DumpClassFile and decompiling the resulting class files. I can see the generated startKeyGetter method for the TableScanResultSet which accesses dicom_send_requests via the sendreq_study_id_fk index: public Object e7() throws StandardException { e8.setColumn(1, (NumberDataValue)getColumnFromRow(5, 1)); return e8; } And I can see the generated code in fillResultSet which is setting up the TableScanResultSet to make this call: getResultSetFactory().getTableScanResultSet( this, 1457, 21, getMethod("e6"), 9, getMethod("e7"), 1, null, -1, true, e9, "DICOM_SEND_REQUESTS", null, "SENDREQ_STUDY_ID_FK", true, false, -1, -1, 6, false, 0, false, 1.2960000000000003D, 265.16399630200306D), All of the generated code looks reasonable and valid, though I don't have much experience reading the generated code. I'm now wondering if this is an order-of-operations thing. During the nested loop join, the following event should occur: - the current row from the outer side of the join should be fetched - the probe for the inner side of the join should be set up - the values from the outer side of the join should be passed to the index positioning logic on the inner side. It is the "startKeyGetter" method that does that last step, but I wonder if it's trying to access the start key values before the project/restrict processing has pushed the values of the current row up from the outer side of the join. \ I'm sorry, Army, I misled you. The code path *does* go through
NoPutResultSetImpl.clearOrderableCache, my mistake. I just wasn't stepping enough times in the debugger :) The comment in getColumnFromRow() is starting to make a little bit more sense now. It still seems like it's wrong for the positioned table scan to be opened without the position value having been set, though, so I'll keep plugging away reading code... :) Hmmm... Now I'm back to thinking that the result number management
has gone astray during the query flattening. Attached is generatedActivation.txt, a de-compilation of the generated activation for the query that fails. Also attached is queryPlanRS.txt, which is a poor-man's-graphical view of the query plan, focusing specifically on the result sets and their result set number (in parentheses next to the result set) I've also specifically called out the two startKeyGetter() methods, "e2" and "e7". E2, which is used for the index probe into DISPATCHER_LOCKS via DISPATCHER_LOCKS_SID_FK, accesses the start key value using result set #4 (see the E2 code in generatedActivation.txt), which seems correct to me. E7, however, which is used for the index probe into DICOM_SEND_REQUESTS via SENDREQ_STUDY_ID_FK, accesses the start key value using result set #5. But this seems wrong. I believe that result set #3 should have been used in the code which is generated into startKeyGetter method E7. I do believe I'm closer, but I'm sure not there yet. I wouldn't mind somebody checking this work, if anybody is interested. I'm thinking about the interaction between NOT EXISTS query flattening and equijoin transitive closure,
and I'm getting myself all confused. Here's what I've been thinking: 1) The join predicate which causes startKeyGetter method E7 to be generated is an equijoin between DISPATCHER_LOCKS.STUDY_ID and DICOM_SEND_REQUESTS.STUDY_ID. 2) This predicate was NOT expressed in the original query as an explicit condition in the WHERE clause; rather, it was generated by Derby during PredicateList.joinClauseTransitiveClosure() 3) Since the result set for DISPATCHER_LOCKS is RS #5, it is completely valid for the E7 method to be fetching the start key value from result set #5. 4) But there is no row in result set number 5, because the query wants to know when the DISPATCHER_LOCKS row does *NOT* exist! 5) That is, we validly flattened the NOT EXISTS subquery into the parent's two-way query between STUDIES and DICOM_SEND_REQUESTS, and we remembered enough information about the relationship between STUDIES and the correlated NOT EXISTS subquery against DISPATCHER_LOCKS, so that we know that the NestedLoopJoin from STUDIES to DISPATCHER_LOCKS should return rows from STUDIES where there is *NOT* a matching row in DISPATCHER_LOCKS, but then we tried to take the value of the (non-existing) row from DISPATCHER_LOCKS and use it to do an index-probe into DICOM_SEND_REQUESTS. So now I'm thinking that when we flattened the NOT EXISTS query into its parent, we needed to do so in a way that reflected that the correlated join predicate was NOT suitable for use in transitive closure. I think it's definitely time for somebody to start checking my work here, because I'm venturing into areas that are quite unfamiliar to me. Can somebody who is comfortable with the concepts of - query flattening - correlated NOT EXISTS subqueries, and - transitive closure of equijoins have a look at this query and offer an opinion about whether it was valid for Derby to flatten and then transitively close the join in this fashion? Hi Bryan,
I did some tracing through the relevant areas of code and it sounds to me like you are very much on track. I was able verify all of the observations (#1 thru #5) from your previous comment, and I confirmed that the NOT EXISTS predicate gives us a non-existent row with which we then try to join to DICOM_SEND_REQUESTS. So I think your investigation here has been great. Based on what you have posted so far, I tend to agree that the join transitive closure should not have created the new join predicate. If you'd like to consider that approach as a solution to this issue, I think it might be a good one. One other potential consideration is that maybe the NOT EXISTS subquery, which is correlated to the outer query, should not have been flattened in the first place? I'm not sure what the implications of correlation and subquery flattening is, but that might be something worth looking up in more detail. In any event, thank you very much for all of your time and effort on this one. I do think you are making great progress here, and it seems like a solution is starting to take shape... Army, thanks very much for the timely feedback! I tried a simple
change to PredicateList.joinClauseTransitiveClosure to omit NOT EXISTS join predicates from consideration for closure, and the initial results were promising: no crash, and I think I am getting the right results. Attached is noClosureOfNotExists.diff, which is not for commit, just to share the work I've done so far. I'll continue testing this patch, to see if I can improve my confidence that it delivers the right results and doesn't break other processing. noClosureOfNotExists patch achieves a 100% pass rate on "ant junitreport".
So that's a good 1st-order check that it doesn't break anything obvious. I no longer think this bug is related to DERBY-2916 or DERBY-2775, so I removed those links.
Attached is 'repro2.java'.
I spent a bunch of time trying to write a "from scratch" reproduction program to demonstrate the bug, but with no success. In the debugger, I can see that the compiler erroneously inserts the transitive join closure predicate, but I haven't been able to force the optimizer to end up with a query plan where the bad predicate matters. I was thinking I might try to use the optimizer overrides feature to force the join plan that I want, but I need to go study that feature first. In the meantime, I attached the first version of the repro program in case somebody else can see how to trick it into reproducing the bug. I've succeeded in using the optimizer overrides to force nested loop joins,
and to force the use of the secondary index access paths, but I haven't yet figured out how to force the join order that I want using the optimizer overrides. It's tricky because the optimizer overrides are specified on the query before it's been flattened, but I want to force a particular join order after flattening. > It's tricky because the optimizer overrides are specified on the query
> before it's been flattened, but I want to force a particular join order > after flattening. Hmm...not sure if that's possible with current override support or not? One thing that influences join order is the relative size of the tables in question. I noticed that for the original database, the row counts are: DSR: 3240 ST: 20 dispatcher_locks: 0 while in the database for "repro2" they are: DSR: 1005 ST: 10003 dispatcher_locs: 0 Notice how DSR is 15x as large as ST in the first case, while ST is 10x as large as DSR in the second case. Perhaps that has something to do with why the optimizer chooses different join orders for the two repros...? Thanks for the suggestion, Army. I've been tinkering with the table sizes in the
repro. However, it doesn't seem to affect the optimizer's behavior. I don't think the optimizer is being told the correct information about the table sizes. Perhaps there is something I need to do to communicate the table sizes to the optimizer, other than simply inserting more rows into the table(s) prior to running the test? I fiddled around with repro2.java, and have attached an updated version.
Manipulating the table sizes and table row statistics in various ways didn't succeed in convincing the optimizer to choose the join order I wanted. I re-wrote the query a different way, and then I seemed to get the join order I wanted, but my query rewrite resulted in re-casting the query as 2 select nodes with 2 tables each, rather than a single node with 3 tables, and hence the compiler never peformed transitive closure of the predicates, so that query didn't tickle the bug either. Drat. I'll sleep on it, and maybe some new ideas will occur. I switched Bryan's repro to use primary and foreign key constraints instead of explicit indexes. That in turn meant that I had to update the INSERT statements to satisfy the foreign key constraints.
Once I did that, I ran the query in "repro2.java" with _no_ optimizer overrides and while it did not fail, it *did* show the desired join order, which is a step in the right direction. Then I (accidentally) swapped the order of the FROM tables in the top-level query (i.e. I put "DSR" first, then "ST", instead of the other way around), and when I did that the query failed with an NPE. Odd? The bad news is that the stack trace for *this* NPE is different from the one reported in this issue. Nonetheless, it follows the same pattern: i.e. it does not occur in 10.1.3.1, and if I back out the changes for Attaching as "repro3.java"... This new failure is very interesting! My initial reaction is that it is the same problem,
with different symptoms. With the new (swapped tables) query: - we still go through exists flattening - we still go through transitive closure, and add a (erroneous) equijoin predicate - just before the crash, we trip the mysterious "if" statement in getColumnFromRow() and return a null column value - applying the noClosureOfNotExists.diff patch removes the NPE and makes the query run correctly. I'll keep stepping through the new NPE, as well as your modified query with the previous query plan "shape", to continue to understand it better. I've persuaded myself that Army's repro3 is reproducing the same basic problem,
and so I'm going to move forward with building a complete patch using repro3 as the core regression test. I'm considering adding the new regression test to ExistsWithSetOpsTest.java, which isn't a perfect location, but I was looking for an existing JUnit test that covered much of the same areas of the code, and that one seemed not too implausible. Is there a better location for the new test? Specifically, are there existing JUnit tests that specifically cover subquery flattening? I'm also intending to open a separate JIRA to continue the investigation of the mystery "if" statement in BaseActivation.getColumnFromRow. It is my belief that that "if" statement can (and should) be removed. Thank you for persistence with this issue, Bryan. I'm glad repro3 appears to have been helpful.
> Specifically, are there existing JUnit tests that specifically cover subquery flattening? I don't see any existing JUnit tests, no. The "subqueryFlattening.sql" test seems like the most relevant place, but as that is not JUnit, I can understand why you would choose to avoid that. If adding a new JUnit test is undesirable and you do decide to extend ExistsWithSetOpsTest, do you also plan to rename that test to make it more appropriate? It would be nice to choose a name that more generally describes the types of tests in that file. I think the common ground here is EXISTS with subqueries...though not necessarily "flattening" since the queries currently in ExistsWithSetOpsTest do not perform flattening per se, they just "transform" the EXISTS subquery into something else... Attached is 'patchWithRenamedTest.diff', a patch proposal.
The 'svn stat' for this patch is: M java/engine/org/apache/derby/impl/sql/compile/FromList.java M java/engine/org/apache/derby/impl/sql/compile/PredicateList.java M java/testing/org/apache/derbyTesting/functionTests/tests/lang/_Suite.java A java/testing/org/apache/derbyTesting/functionTests/tests/lang/ExistsWithSubqueriesTest.java D java/testing/org/apache/derbyTesting/functionTests/tests/lang/ExistsWithSetOpsTest.java The patch contains a regression test for this issue, modeled on the repro3.java test case and inserted into the (renamed) ExistsWithSubqueriesTest.java. The core of the patch is the same, which is to notice that a join predicate mentions a table which has the NOT EXISTS usage, and to exclude that predicate from transitive join consideration. I think this patch is worthy of review; please have a look when you have a chance. Thank you for the patch, Bryan. The diff is small and contained and looks functionally correct to me. I ran the new regression test without your patch and it failed, and with your patch it succeeds.
I do think it might be nice to add explanatory comments to the relevant engine code, so that future readers know *why* it is that we are checking for a "NOT EXISTS" query. If you'd rather not add comments to the transitive closure code itself, you could perhaps add comments to the FromList.tableNumberIsNotExists() method explaining the scenario in which it is used. I think there are ample comments in the new ExistsWithSubqueriesTest file, so maybe you could just copy the important bits into the engine code comments, as well? Aside from that, and assuming derbyall and suites.All ran cleanly, I'm +1 to the commit of this patch. Many thanks for picking this one up, and for putting so much time into tracking down the cause/solution! I agree, comments are definitely important for this change.
"I meant to add them before but forgot" :) Thanks for catching this! Attached is a revised patch; the only change is improved comments in FromList and PredicateList. I'm intending to run a complete set of tests, verify that javadoc builds cleanly, and then submit this patch. Continued feedback is very welcome of course. derbyall and junit suites.All were 100% clean.
generated javadoc looks ok, now that I have my jdk 1.5 build envt set up.
Committed the patchWithImprovedComments to the trunk as revision 581943.
I intend to merge this patch back to the 10.3 line in the next week. I merged the trunk patch back to the 10.3 branch without issues, and my build and test runs were clean, so I committed the patch to the 10.3 branch as revision 582439.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
If not, can you prove any details (table, query) so we can try to reproduce your issue?