Issue Details (XML | Word | Printable)

Key: DERBY-3033
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Bryan Pendleton
Reporter: Haolan Qin
Votes: 1
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Derby

select query results in nullpointer exception in skipScan()

Created: 29/Aug/07 08:35 PM   Updated: 30/Jun/09 03:55 PM
Return to search
Component/s: SQL
Affects Version/s: 10.2.2.0
Fix Version/s: 10.3.2.1, 10.4.1.3

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works bug4736.sql 2007-09-14 11:29 PM Bryan Pendleton 1 kB
Text File Licensed for inclusion in ASF works d3033-sane-ij-session-10.3.1.5.txt 2007-08-31 01:45 AM Dag H. Wanvik 8 kB
Text File Licensed for inclusion in ASF works generatedActivation.txt 2007-09-19 10:13 PM Bryan Pendleton 11 kB
File Licensed for inclusion in ASF works noClosureOfNotExists.diff 2007-09-21 01:48 AM Bryan Pendleton 2 kB
File Licensed for inclusion in ASF works patchWithImprovedComments.diff 2007-10-02 02:59 AM Bryan Pendleton 43 kB
File Licensed for inclusion in ASF works patchWithRenamedTest.diff 2007-09-29 03:33 PM Bryan Pendleton 41 kB
File query_plan.new 2007-09-12 12:10 AM Kathey Marsden 10 kB
File query_plan.old 2007-09-12 12:10 AM Kathey Marsden 7 kB
Text File Licensed for inclusion in ASF works queryPlanRS.txt 2007-09-19 10:13 PM Bryan Pendleton 1 kB
Java Source File Licensed for inclusion in ASF works repro2.java 2007-09-25 04:02 AM Bryan Pendleton 9 kB
Java Source File Licensed for inclusion in ASF works repro2.java 2007-09-23 03:57 PM Bryan Pendleton 7 kB
Java Source File Licensed for inclusion in ASF works repro3.java 2007-09-25 04:52 PM A B 9 kB
File test.rar 2007-08-30 03:57 PM Haolan Qin 946 kB
Zip Archive test.zip 2007-09-03 03:11 PM Dag H. Wanvik 1.16 MB
Zip Archive viewer_10_1.zip 2007-09-04 01:58 PM Kathey Marsden 803 kB
Environment: Windows XP, Java 5.0, JDBC, Derby 10.2.2.0
Issue Links:
Reference

Bug behavior facts: Regression
Resolution Date: 06/Oct/07 02:58 AM


 Description  « Hide
The following error was repeatedly thrown when we tried to run a select query via JDBC. Strangely, the exact same select query did not trigger any error when run from the command line console. After we added an index, the error went away completely.

java.lang.NullPointerException
 at org.apache.derby.impl.sql.execute.NoPutResultSetImpl.skipScan(Unknown Source)
 at org.apache.derby.impl.sql.execute.TableScanResultSet.openCore(Unknown Source)
 at org.apache.derby.impl.sql.execute.IndexRowToBaseRowResultSet.openCore(Unknown Source)
 at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(Unknown Source)
 at org.apache.derby.impl.sql.execute.JoinResultSet.openRight(Unknown Source)
 at org.apache.derby.impl.sql.execute.JoinResultSet.openCore(Unknown Source)
 at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(Unknown Source)
 at org.apache.derby.impl.sql.execute.SortResultSet.openCore(Unknown Source)
 at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(Unknown Source)
 at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.open(Unknown Source)
 at org.apache.derby.impl.sql.GenericPreparedStatement.execute(Unknown Source)
 at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source)
 at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(Unknown Source)
 at org.apache.derby.impl.jdbc.EmbedPreparedStatement.execute(Unknown Source)
 at org.apache.derby.impl.drda.DRDAStatement.execute(Unknown Source)
 at org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown Source)
 at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source)

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Dag H. Wanvik added a comment - 29/Aug/07 09:55 PM
Are you able to provide a program that reproduces the error?
If not, can you prove any details (table, query) so we can try to reproduce your issue?

Haolan Qin added a comment - 30/Aug/07 03:57 PM
1. copy the attached "viewer" folder to your derby folder
2. launch derby on port 9002
3. java -cp .;./derbyclient-10.2.2.0.jar SkipScan

You will see the exception in the console.

The exception will NOT be thrown if you modify the java file to hard-code those numbers (4 and 100) in the query or if you run the query from a sql console window. Also, if you add an index on dicom_send_requests.send_date and run the same program, the exception will not be thrown.

If you turn on execution plan logging, you will see that the exception scenario has a different execution plan from all other scenarios.

Dag H. Wanvik added a comment - 30/Aug/07 09:10 PM
Thank you, Haolan! I was able to reproduce the error with your repro!


Dag H. Wanvik added a comment - 30/Aug/07 10:16 PM
More data points:
- happens with embedded driver as well
- happens when database is running in soft upgrade to 10.3
- happens in 10.3 after hard upgrade
- happens when using ij/embedded with prepared statement thus:

  ij> connect 'jdbc:derby:viewer;user=view;password=view';
  ij> set schema viewer;
  ij> prepare foo as 'select study_id, number_of_images from (select distinct st.study_id,
       st.number_of_images, dsr.priority, st.creation_datetime from dicom_send_requests dsr,
       studies st where dsr.send_date is null and dsr.workstation_id = ? and
       dsr.study_id = st.study_id and not exists ( select 1 from dispatcher_locks where
       dispatcher_locks.study_id = st.study_id and dispatcher_locks.workstation_id = ? and
       dispatcher_locks.dispatcher_id = ? )) temp';
  ij> execute foo using 'values (4,4,100)';
  ERROR XJ001: Java exception: ': java.lang.NullPointerException'

Without prepared statement (direct execution) there is no error.

I will leave this one the the execution experts ;-)

Bryan Pendleton added a comment - 31/Aug/07 12:38 AM
Hi Dag, since you have it reproduced, can you re-post the stack trace with the actual line numbers, and maybe also post the line of source code where the NPE actually occurs? Thanks!

Dag H. Wanvik added a comment - 31/Aug/07 01:45 AM
Sure, Bryan!

As far as I can the offending line is
is NoPutResultSetImpl.java:392 at head of 10.3 branch,
(svn 571351):

392: if (startPosition.getColumn(position + 1).isNull())

Uploading stack trace "d3033-sane-ij-session-10.3.1.5.txt" with
sysinfo and ij session showing the repro with trace w/line numbers.

Bryan Pendleton added a comment - 31/Aug/07 02:06 AM
Thanks Dag. This is an interesting line of code, and an interesting stack trace, particularly given the observation that when the original reporter added an index, the problem went away. I'm wondering if this problem is related to DERBY-2916 and DERBY-2775.

Does the query in the bug involve an OUTER JOIN? or contain an IS [NOT] NULL condition? And is this condition related to the index that was added which causes the problem to go away?

At the point where TableScanResultSet.openCore calls NoPutResultSetImpl.skipScan, there is the following comment:

// Check whether there are any comparisons with unordered nulls
// on either the start or stop position. If there are, we can
// (and must) skip the scan, because no rows can qualify

It seems like DERBY-2775 may have caused us to take some new paths through the code involving "ordered null semantics", and the skipScan method is at the core of that code (see the DERBY-2916 comments for more discussion of this), so I have an intuition that this problem is related to those changes.

For now, I'll link this issue to those issues, in the hope that my intuition is well-founded.

Dag H. Wanvik added a comment - 31/Aug/07 02:26 AM
Yes, you are correct. The query contains this predicate
"where dsr.send_date is null and .." and column "send_date" is
the column for which the index was added to make the problem go away.

Kathey Marsden added a comment - 31/Aug/07 05:12 PM - edited
I don't have a tool to open the rar file for this issue. I was wondering if someone could repackage it as a zip or jar file. I'd like to try it on 10.1 to see if it is a regression.

Dag H. Wanvik added a comment - 03/Sep/07 03:09 PM
Repacked to test.zip.
I think this is a 10.2 database, so you may need to recreate the database
with dblook and export import to get a repro for 10.1, not sure.

Dag H. Wanvik added a comment - 03/Sep/07 03:11 PM
Deleted first attchement. Reattaching without ASF license since original was without.

Kathey Marsden added a comment - 04/Sep/07 01:56 PM
Looks like this is a regression in 10.2. The results with 10.1.2.1 are
> java SkipScan
dfdfddd

I will attach the 10.1 database which I migrated back using ddlutils

Kathey Marsden added a comment - 04/Sep/07 01:58 PM
v 10.1 database. Issue does not reproduce with 10.1.2.1

Kathey Marsden added a comment - 11/Sep/07 03:40 AM - edited
This regression occurred with revision 423754 from DERBY-1357. It could very well be that the optimizer changes made in that issue are fine but pushed this case into a different plan which exposed the issue.


Kathey Marsden added a comment - 12/Sep/07 12:10 AM
A little more historical information on this issue.
1) Army suggested I try backing all the way back to revision 81858 before DERBY-805 and apply the DERBY-1357 to make sure there was not a problem in DERBY-805 causing the issue. I did that and found that the NPE still reproduced, so it looks like DERBY-805 had no effect.

2) I printed the plans before/after DERBY-1357. query_plan.old is before 1357 (without the NPE). query_plan.new is after DERBY-1357 with the NPE.



Kathey Marsden added a comment - 12/Sep/07 03:44 PM
I was wondering in the skipScan code, should the column values of startPosition ever be null? In this case column[0] is null, but I was wondering if that is ever expected.

startPosition IndexRow (id=136)
column DataValueDescriptor[1] (id=149)
[0] null
ncols 1
orderedNulls boolean[1] (id=150)
[0] false

Just to try to understand the impact I tried this small change, which interestingly caused the test case and suiltes.All to pass, but I don't really understand what it means for the column value to be null.
I am guessing this is not the correct solution but rather there is some sort of problem in generating the column value for startPosition. Just want to check though in case there is a circumstance where the column value should be null and this change is ok.

Index: java/engine/org/apache/derby/impl/sql/execute/NoPutResultSetImpl.java
===================================================================
--- java/engine/org/apache/derby/impl/sql/execute/NoPutResultSetImpl.java (revision 574932)
+++ java/engine/org/apache/derby/impl/sql/execute/NoPutResultSetImpl.java (working copy)
@@ -388,7 +388,8 @@
                        {
                                if (startKeyLonger)
                                        checkNullCols[cncLen++] = position + 1;
- if (startPosition.getColumn(position + 1).isNull())
+ DataValueDescriptor d = startPosition.getColumn(position +1);
+ if (d == null || d.isNull())
                                {
                                        returnValue = true;
                                        if (! startKeyLonger)


Thanks for the help. I am certainly out of my element with this code, so appreciate any help or pointers anyone can offer.

Kathey


A B added a comment - 12/Sep/07 04:32 PM
Hi Kathey,

My guess (and it's just a guess) is that, while startPosition itself may be null, it probably is not correct for a non-null startPosition to have startPosition[i] be null for any i.

For what it's worth: startPosition in the above code comes from TableScanResultSet.startPosition, which is in turn loaded form TableScanResultSet.startKeyGetter. The startKeyGetter is, I believe, generated from PredicateList.generateStartKey(). So if you are able to trace through that method to see which predicate is causing the "null" to be generated, that might be a good starting point. Perhaps the short-circuit logic added for DERBY-1357 is somehow leaving one or more predicates in a state that causes incorrect generation of a null (instead of a DataValueDescriptor that is "isNull()")? Again, I'm just guessing--maybe that's helpful, maybe not...

Kathey Marsden added a comment - 13/Sep/07 09:17 PM
In BaseActivation I see

protected final DataValueDescriptor getColumnFromRow(int rsNumber, int colId)
throws StandardException {

        if( row[rsNumber] == null)
        {
            /* This actually happens. NoPutResultSetImpl.clearOrderableCache attempts to prefetch invariant values
             * into a cache. This fails in some deeply nested joins. See Beetle 4736 and 4880.
             */
            return null;
        }
return row[rsNumber].getColumn(colId);
}

I am not sure why prefetching invariant values would cause them to be null, but looks like a good lead.



Kathey Marsden added a comment - 14/Sep/07 06:49 PM
FYI: Here is the case from bug 4736 that was giving an NPE in getColumnFromRow before the fix to check for a null row was added. I tend to think that fix was wrong, that there is some root cause for the null row which needs to be addressed instead, and hopefully that is the same root cause for the DERBY-3033 test case. I am off of playing with DERBY-3033 for a while. Hope someone else can pick this up given what Haolan, Army and I have found so far.

Kathey


create table a (a1 int not null primary key, a2 int, a3 int, a4
int, a5 int, a6 int);
create table b (b1 int not null primary key, b2 int, b3 int, b4
int, b5 int, b6 int);
create table c (c1 int not null, c2 int, c3 int not null, c4
int, c5 int, c6 int);
create table d (d1 int not null, d2 int, d3 int not null, d4
int, d5 int, d6 int);

alter table c add primary key (c1,c3);
alter table d add primary key (d1,d3);

insert into a values
(1,1,3,6,NULL,2),(2,3,2,4,2,2),(3,4,2,NULL,NULL,NULL),
                     
(4,NULL,4,2,5,2),(5,2,3,5,7,4),(7,1,4,2,3,4),
                     (8,8,8,8,8,8),(6,7,3,2,3,4);

insert into b values
(6,7,2,3,NULL,1),(4,5,9,6,3,2),(1,4,2,NULL,NULL,NULL),
                     
(5,NULL,2,2,5,2),(3,2,3,3,1,4),(7,3,3,3,3,3),(9,3,3,3,3,3);

insert into c values
(3,7,7,3,NULL,1),(8,3,9,1,3,2),(1,4,1,NULL,NULL,NULL),
                     
(3,NULL,1,2,4,2),(2,2,5,3,2,4),(1,7,2,3,1,1),(3,8,4,2,4,6);

insert into d values
(1,7,2,3,NULL,3),(2,3,9,1,1,2),(2,2,2,NULL,3,2),
                     
(1,NULL,3,2,2,1),(2,2,5,3,2,3),(2,5,6,3,7,2);



select a1,b1,c1,c3,d1,d3
  from D join (A left outer join (B join C on b2=c2) on a1=b1)
on d3=b3 and d1=a2;

A1 |B1 |C1 |C3
|D1 |D3
----------------------------------------------------------------
-------------------------------
1 |1 |4 |2
|1 |2
ERROR 38000: The exception 'java.lang.NullPointerException' was
thrown while evaluating an expression.
ERROR 38000: The exception 'java.lang.NullPointerException' was
thrown while evaluating an expression.
 at
com.ibm.db2j.protocol.Database.Errors.StandardException.newExcep
tion(StandardException.java:386)
 at
com.ibm.db2j.protocol.Database.Errors.StandardException.unexpect
edUserException(StandardException.java:567)
 at
com.ibm.db2j.impl.BasicServices.ClassManager.Reflect.DirectCall.
invoke(ReflectGeneratedClass.java:118)
 at
com.ibm.db2j.impl.Database.Language.Execution.Generic.GenericQua
lifier.getOrderable(GenericQualifier.java:87)
 at
com.ibm.db2j.impl.Database.Language.Execution.Generic.HashScanRe
sultSet.getNextRowCore(HashScanResultSet.java:477)
 at
com.ibm.db2j.impl.Database.Language.Execution.Generic.NestedLoop
JoinResultSet.getNextRowCore(NestedLoopJoinResultSet.java:116)
 at
com.ibm.db2j.impl.Database.Language.Execution.Generic.ProjectRes
trictResultSet.getNextRowCore(ProjectRestrictResultSet.java:249)

 at
com.ibm.db2j.impl.Database.Language.Execution.Generic.BasicNoPut
ResultSetImpl.getNextRow(BasicNoPutResultSetImpl.java:479)
 at
com.ibm.db2j.impl.Connectivity.JDBC.Local.LocalResultSet.movePos
ition(LocalResultSet.java:297)
 at
com.ibm.db2j.impl.Connectivity.JDBC.Local.LocalResultSet.next(Lo
calResultSet.java:251)
 at
com.ibm.db2j.util.JDBCDisplayUtil.indent_DisplayResults(JDBCDisp
layUtil.java:335)
 at
com.ibm.db2j.util.JDBCDisplayUtil.indent_DisplayResults(JDBCDisp
layUtil.java:272)
 at
com.ibm.db2j.util.JDBCDisplayUtil.DisplayResults(JDBCDisplayUtil
.java:261)
 at
com.ibm.db2j.tools.ijImpl.utilMain.displayResult(utilMain.java:3
66)
 at
com.ibm.db2j.tools.ijImpl.utilMain.doCatch(utilMain.java:419)
 at com.ibm.db2j.tools.ijImpl.utilMain.go(utilMain.java:295)
 at com.ibm.db2j.tools.ijImpl.Main.go(Main.java:196)
 at com.ibm.db2j.tools.ijImpl.Main.mainCore(Main.java:163)
 at com.ibm.db2j.tools.ijImpl40.Main.main(Main.java:49)
 at com.ibm.db2j.tools.ij.main(ij.java:57)
ERROR XJ001: Java exception: ':
java.lang.NullPointerException'.
java.lang.NullPointerException
 at
com.ibm.db2j.impl.Database.Language.Execution.Generic.BaseActiva
tion.getColumnFromRow(BaseActivation.java:1234)
 at
db2j.exe.ac601a400fx00f0xe017x7cc0x007f0000010012c.e8(Unknown
Source)
 at
com.ibm.db2j.impl.BasicServices.ClassManager.Reflect.DirectCall.
invoke(ReflectGeneratedClass.java:109)
 at
com.ibm.db2j.impl.Database.Language.Execution.Generic.GenericQua
lifier.getOrderable(GenericQualifier.java:87)
 at
com.ibm.db2j.impl.Database.Language.Execution.Generic.HashScanRe
sultSet.getNextRowCore(HashScanResultSet.java:477)
 at
com.ibm.db2j.impl.Database.Language.Execution.Generic.NestedLoop
JoinResultSet.getNextRowCore(NestedLoopJoinResultSet.java:116)
 at
com.ibm.db2j.impl.Database.Language.Execution.Generic.ProjectRes
trictResultSet.getNextRowCore(ProjectRestrictResultSet.java:249)

 at
com.ibm.db2j.impl.Database.Language.Execution.Generic.BasicNoPut
ResultSetImpl.getNextRow(BasicNoPutResultSetImpl.java:479)
 at
com.ibm.db2j.impl.Connectivity.JDBC.Local.LocalResultSet.movePos
ition(LocalResultSet.java:297)
 at
com.ibm.db2j.impl.Connectivity.JDBC.Local.LocalResultSet.next(Lo
calResultSet.java:251)
 at
com.ibm.db2j.util.JDBCDisplayUtil.indent_DisplayResults(JDBCDisp
layUtil.java:335)
 at
com.ibm.db2j.util.JDBCDisplayUtil.indent_DisplayResults(JDBCDisp
layUtil.java:272)
 at
com.ibm.db2j.util.JDBCDisplayUtil.DisplayResults(JDBCDisplayUtil
.java:261)
 at
com.ibm.db2j.tools.ijImpl.utilMain.displayResult(utilMain.java:3
66)
 at
com.ibm.db2j.tools.ijImpl.utilMain.doCatch(utilMain.java:419)
 at com.ibm.db2j.tools.ijImpl.utilMain.go(utilMain.java:295)
 at com.ibm.db2j.tools.ijImpl.Main.go(Main.java:196)
 at com.ibm.db2j.tools.ijImpl.Main.mainCore(Main.java:163)
 at com.ibm.db2j.tools.ijImpl40.Main.main(Main.java:49)
 at com.ibm.db2j.tools.ij.main(ij.java:57)
ij>

Bryan Pendleton added a comment - 14/Sep/07 11:29 PM
Hi Kathey, thanks for the info about bug 4736.

I took your pasted code and added 4 drop table statements to the front of it so I could run it repeatedly; it's attached as bug4736.sql.

I ran that code repeatedly with my debugger set with a breakpoint in BaseActivation.getColumnFromRow() with a breakpoint set when
row[rsNumber] was null (i.e., inside the "if" that you noted).

Unfortunately, I never hit the breakpoint.

I also tried removing that "if" statement from getColumnFromRow(), to see
if the bug4736.sql would then provoke a NullPointerException, and that
didn't happen either.

So, whatever was going on with that "if" test, the bug4736.sql script no longer drives the conditions which cause row[rsNumber] to be null.

I think you've got a good theory, though, and I'll continue to play with it, to try to understand that "if" statement in getColumnFromRow() better.

Bryan Pendleton added a comment - 17/Sep/07 10:41 PM
I removed the mysterious "if" statement from BaseActivation.getColumnFromRow(), and ran
all the regression tests. There were two failures, but I'm not sure if they were due to this
change or not. I'll investigate them further. I did *not* get any NullPointerException messages,
which is what I was sort of expecting to see.

So if removing the "if" statement from BaseActivation.java fixes this problem, it apparently
isn't going to cause a whole lot of other problems.

I think the next step is to investigate whether removing the "if" statement makes the repro script
pass.

Bryan Pendleton added a comment - 18/Sep/07 12:28 AM
I played around with the skipScan repro case and confirmed Kathey's suspicion about
BaseActivation.getColumnFromRow, as follows:
 - with the current trunk, the repro case gives me the expected NPE in skipScan.
 - with the "if" statement removed from BaseActivation.getColumnFromRow, the
   repro case gives me a NPE which matches the NPE that Kathey posted from the
   old bug 4736. So this reproduction case definitely takes the code through a
  similar code path as that of bug 4736.

Here's the stack trace I get with the current trunk with the "if" removed from getColumnFromRow:

2007-09-18 00:18:51.793 GMT Thread[DRDAConnThread_2,5,main] (XID = 352997), (SESSIONID = 0), (DATABASE = viewer), (DRDAID = NF000001.B54B-810083792898537651{1}), Failed Statement is: select study_id, number_of_images from (select distinct st.study_id, st.number_of_images, dsr.priority, st.creation_datetime from dicom_send_requests dsr, studies st where dsr.send_date is null and dsr.workstation_id = ? and dsr.study_id = st.study_id and not exists ( select 1 from dispatcher_locks where dispatcher_locks.study_id = st.study_id and dispatcher_locks.workstation_id = ? and dispatcher_locks.dispatcher_id = ? ) ) temp with 3 parameters begin parameter #1: 4 :end parameter begin parameter #2: 4 :end parameter begin parameter #3: 100 :end parameter
ERROR 38000: The exception 'java.lang.NullPointerException' was thrown while evaluating an expression.
        at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:294)
        at org.apache.derby.iapi.error.StandardException.unexpectedUserException(StandardException.java:554)
        at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:164)
        at org.apache.derby.impl.sql.execute.TableScanResultSet.openCore(TableScanResultSet.java:258)
        at org.apache.derby.impl.sql.execute.IndexRowToBaseRowResultSet.openCore(IndexRowToBaseRowResultSet.java:225)
        at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(ProjectRestrictResultSet.java:168)
        at org.apache.derby.impl.sql.execute.JoinResultSet.openRight(JoinResultSet.java:272)
        at org.apache.derby.impl.sql.execute.JoinResultSet.openCore(JoinResultSet.java:151)
        at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(ProjectRestrictResultSet.java:168)
        at org.apache.derby.impl.sql.execute.SortResultSet.openCore(SortResultSet.java:248)
        at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(ProjectRestrictResultSet.java:168)
        at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.open(BasicNoPutResultSetImpl.java:248)
        at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:370)
        at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1225)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1649)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement.execute(EmbedPreparedStatement.java:1304)
        at org.apache.derby.impl.drda.DRDAStatement.execute(DRDAStatement.java:666)
        at org.apache.derby.impl.drda.DRDAConnThread.processCommands(DRDAConnThread.java:824)
        at org.apache.derby.impl.drda.DRDAConnThread.run(DRDAConnThread.java:275)
Caused by: java.lang.NullPointerException
        at org.apache.derby.impl.sql.execute.BaseActivation.getColumnFromRow(BaseActivation.java:1317)
        at org.apache.derby.exe.ac601a400fx0115x15fbx3163xffff9ef86e6e1.e7(Unknown Source)
        at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:153)
        ... 16 more
============= begin nested exception, level (1) ===========
java.lang.NullPointerException
        at org.apache.derby.impl.sql.execute.BaseActivation.getColumnFromRow(BaseActivation.java:1317)
        at org.apache.derby.exe.ac601a400fx0115x15fbx3163xffff9ef86e6e1.e7(Unknown Source)
        at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:153)
        at org.apache.derby.impl.sql.execute.TableScanResultSet.openCore(TableScanResultSet.java:258)
        at org.apache.derby.impl.sql.execute.IndexRowToBaseRowResultSet.openCore(IndexRowToBaseRowResultSet.java:225)
        at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(ProjectRestrictResultSet.java:168)
        at org.apache.derby.impl.sql.execute.JoinResultSet.openRight(JoinResultSet.java:272)
        at org.apache.derby.impl.sql.execute.JoinResultSet.openCore(JoinResultSet.java:151)
        at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(ProjectRestrictResultSet.java:168)
        at org.apache.derby.impl.sql.execute.SortResultSet.openCore(SortResultSet.java:248)
        at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.openCore(ProjectRestrictResultSet.java:168)
        at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.open(BasicNoPutResultSetImpl.java:248)
        at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:370)
        at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1225)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1649)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement.execute(EmbedPreparedStatement.java:1304)
        at org.apache.derby.impl.drda.DRDAStatement.execute(DRDAStatement.java:666)
        at org.apache.derby.impl.drda.DRDAConnThread.processCommands(DRDAConnThread.java:824)
        at org.apache.derby.impl.drda.DRDAConnThread.run(DRDAConnThread.java:275)

A B added a comment - 18/Sep/07 03:31 PM
> I played around with the skipScan repro case and confirmed Kathey's suspicion
> about BaseActivation.getColumnFromRow

Thank you for your continued investigation of this issue, Bryan. It's great to have an idea of the codepath in question here. Have you had the time/inclination to look into the cause of the null value in row[rsNumber]? I realize that's easy to say and quite hard to do, but I was just wondering if you've looked at it, or if you have the time and plans to do so?

In particular, I'm referring to the code comment that Kathey mentioned:

  "NoPutResultSetImpl.clearOrderableCache attempts to prefetch invariant values
   into a cache. This fails in some deeply nested joins."

I think the core question here is *why* does "this fail in some deeply nested joins"? That sounds awfully suspicious. It seems that either a) the "failure" is acceptable and the comments need to be updated, in which case Kathey's "quick fix" of checking for the null value would be appropriate; or b) the "failure" is a bug and in order to solve this issue, that failure should (ideally) be fixed...

Would you agree?

Bryan Pendleton added a comment - 18/Sep/07 06:33 PM
Hi Army, thanks for the feedback! I marked the issue as assigned to me, because I'm trying to understand it, but progress is slow.

I believe that the null value in row[rsNumber] is coming from a request to fetch column 1 of result set 5. Here's the original query for context:

    select study_id, number_of_images from (select distinct st.study_id,
       st.number_of_images, dsr.priority, st.creation_datetime from dicom_send_requests dsr,
       studies st where dsr.send_date is null and dsr.workstation_id = ? and
       dsr.study_id = st.study_id and not exists ( select 1 from dispatcher_locks where
       dispatcher_locks.study_id = st.study_id and dispatcher_locks.workstation_id = ? and
       dispatcher_locks.dispatcher_id = ? )) temp

Result set 5, I believe, is the "select 1 from dispatcher_locks" nested query, and column 1
of that result set is dispatcher_locks.study_id.

The null value occurs at the very beginning of the first time that we are opening a TableScanResultSet.
We are setting up a NestedLoopJoinResultSet, and the right side of that join is a index-to-base probe,
and so we are trying to set the start position for a particular "probe" into the index for a particular
study_id value. sameStartStopPosition is true, as the join on study_id is a simple equality join.

The code in question, near the top of TableScanResultSet.openCore(), is:

if (startKeyGetter != null)
{
startPosition = (ExecIndexRow) startKeyGetter.invoke(activation);
if (sameStartStopPosition)
{
stopPosition = startPosition;
}
}
if (stopKeyGetter != null)
{
stopPosition = (ExecIndexRow) stopKeyGetter.invoke(activation);
}

The TableScanResultSet instance that we are processing is for an index-to-base probe of
the dicom_send_requests table via the sendreq_study_id_fk secondary index.

I'm thinking that, although the external form of the SQL SELECT statement is:
 - dicom_send_requests joins to studies on study_id, and
 - studies joins to dispatcher_locks on study_id
the optimizer may have re-arranged this and selected a different join order so that the actual
query is more like:
 - dispatcher_locks joins to dicom_send_requests on study_id, and
 - dicom_send_requests joins to studies on study_id

And then, perhaps because the dispatcher_locks query is a "select 1" query, the
execution of the query against dispatcher_locks hasn't fetched any actual data, and so
the column is coming back null.

So far, however, I haven't seen any code path that is going
through NoPutResultSetImpl.clearOrderableCache, so I don't any information
to share regarding the code comment that Kathey observed.

A B added a comment - 18/Sep/07 07:10 PM
> I haven't seen any code path that is going through
> NoPutResultSetImpl.clearOrderableCache, so I don't
> have any information to share regarding the code
> comment [...]

It's possible that the comment is just a red herring--if clearOrderableCache() is not in play, then there's probably no point going further down that road. The information that you provided above certainly has potential--great investigation.

> the optimizer may have re-arranged this

To add to this: I think that for the optimizer to re-arrange the join order between dispatcher_locks, dicom_send_requests, and studies, all of those result sets must (I *think*?) appear in the same compile-time FromList. That means that the NOT EXISTS subquery will have to be "flattened" into the "select distinct st.study_id, ..." query. So it may be useful to look at SubqueryNode.preprocess() to see a) if the NOT EXISTS subquery is in fact being flattened, and b) if the column references that result from that flattening are set up correctly.

If the NOT EXISTS subquery is in fact being flattened, it might be good to stop the debugger just after "modification of access paths" has occured on the relevant FromList (the one containing dispatcher_locks, dicom_send_requests, and studies) to see what join order (and join strategies) the optimizer has chosen. Perhaps that'll provide some insight as to whether or not the optimizer re-arrangement is in play here...?

Of course I'm just offering top-of-my-head suggestions; feel free to ignore and continue with your current path of investigation, as that certainly seems to be yielding some very useful information...

Bryan Pendleton added a comment - 18/Sep/07 09:01 PM
Thanks for the pointer to SubqueryNode.preprocess; we're definitely flattening this query.
The code goes through the flattenToExistsJoin() path on line 742 of SubqueryNode.java.

And, to substantiate that more, it's clear from close examination of the data structures in
the debugger that the overall join tree at execution time is:

  (Studies NestedLoop Dispatcher_locks) NestedLoop dicom_send_requests

And at the point of the crash we've:
 - fetched the first row from Studies,
 - joined it against Dispatcher_locks (discovering that no matching row exists, so NOT EXISTS is true)
 - taken the resulting row and are ready to join it against dicom_send_requests.

I'm wondering if the result set book-keeping gets messed up during flattening. At the point
that we call getColumnFromRow() to fetch a value from result set 5, it looks to me like
the information in row[3] is the correct information that we should be accessing. That is,
when I back up the stack to the NestedLoopJoinResultSet frame, and look at the value
of "leftRow", which is the row that (I think) should be providing the values that we are about
to use for the start/stop position computations in the index scan positioning, the leftRow
instance from that join matches up with row[3] in the BaseActivation generated instance.

So I'll look at how the result_set numbering is handled during flattening.

Bryan Pendleton added a comment - 19/Sep/07 05:47 PM
I've been trying to understand the structure and content of the generated code for
the activation, by using derby.debug.true=DumpClassFile and decompiling the
resulting class files.

I can see the generated startKeyGetter method for the TableScanResultSet
which accesses dicom_send_requests via the sendreq_study_id_fk index:

    public Object e7()
        throws StandardException
    {
        e8.setColumn(1, (NumberDataValue)getColumnFromRow(5, 1));
        return e8;
    }

And I can see the generated code in fillResultSet which is setting up the
TableScanResultSet to make this call:

getResultSetFactory().getTableScanResultSet(
                                this, 1457, 21, getMethod("e6"), 9, getMethod("e7"),
                                1, null, -1, true, e9, "DICOM_SEND_REQUESTS", null,
                                "SENDREQ_STUDY_ID_FK", true, false, -1, -1, 6,
                                false, 0, false, 1.2960000000000003D, 265.16399630200306D),

All of the generated code looks reasonable and valid, though I don't have
much experience reading the generated code.

I'm now wondering if this is an order-of-operations thing. During the nested loop
join, the following event should occur:
 - the current row from the outer side of the join should be fetched
 - the probe for the inner side of the join should be set up
 - the values from the outer side of the join should be passed to the
   index positioning logic on the inner side.

It is the "startKeyGetter" method that does that last step, but I wonder if it's
trying to access the start key values before the project/restrict processing has
pushed the values of the current row up from the outer side of the join.
\


Bryan Pendleton added a comment - 19/Sep/07 08:21 PM
I'm sorry, Army, I misled you. The code path *does* go through
NoPutResultSetImpl.clearOrderableCache, my mistake. I just
wasn't stepping enough times in the debugger :)

The comment in getColumnFromRow() is starting to make a little bit more sense now.

It still seems like it's wrong for the positioned table scan to be
opened without the position value having been set, though, so
I'll keep plugging away reading code... :)

Bryan Pendleton added a comment - 19/Sep/07 10:13 PM
Hmmm... Now I'm back to thinking that the result number management
has gone astray during the query flattening.

Attached is generatedActivation.txt, a de-compilation of the
generated activation for the query that fails.

Also attached is queryPlanRS.txt, which is a poor-man's-graphical
view of the query plan, focusing specifically on the result sets and their
result set number (in parentheses next to the result set)

I've also specifically called out the two startKeyGetter()
methods, "e2" and "e7".

E2, which is used for the index probe into DISPATCHER_LOCKS
via DISPATCHER_LOCKS_SID_FK, accesses the start key value
using result set #4 (see the E2 code in generatedActivation.txt),
which seems correct to me.

E7, however, which is used for the index probe into DICOM_SEND_REQUESTS
via SENDREQ_STUDY_ID_FK, accesses the start key value using
result set #5. But this seems wrong. I believe that result set #3
should have been used in the code which is generated into
startKeyGetter method E7.

I do believe I'm closer, but I'm sure not there yet.

I wouldn't mind somebody checking this work, if anybody is interested.

Bryan Pendleton added a comment - 20/Sep/07 09:27 PM
I'm thinking about the interaction between NOT EXISTS query flattening and equijoin transitive closure,
and I'm getting myself all confused. Here's what I've been thinking:

1) The join predicate which causes startKeyGetter method E7 to be
generated is an equijoin between DISPATCHER_LOCKS.STUDY_ID
and DICOM_SEND_REQUESTS.STUDY_ID.

2) This predicate was NOT expressed in the original query as
an explicit condition in the WHERE clause; rather, it was
generated by Derby during PredicateList.joinClauseTransitiveClosure()

3) Since the result set for DISPATCHER_LOCKS is RS #5, it is
completely valid for the E7 method to be fetching the start key value
from result set #5.

4) But there is no row in result set number 5, because the query wants
to know when the DISPATCHER_LOCKS row does *NOT* exist!

5) That is, we validly flattened the NOT EXISTS subquery into the
parent's two-way query between STUDIES and DICOM_SEND_REQUESTS,
and we remembered enough information about the relationship between
STUDIES and the correlated NOT EXISTS subquery against
DISPATCHER_LOCKS, so that we know that the NestedLoopJoin
from STUDIES to DISPATCHER_LOCKS should return rows from STUDIES
where there is *NOT* a matching row in DISPATCHER_LOCKS, but
then we tried to take the value of the (non-existing) row from
DISPATCHER_LOCKS and use it to do an index-probe into DICOM_SEND_REQUESTS.

So now I'm thinking that when we flattened the NOT EXISTS query
into its parent, we needed to do so in a way that reflected that the
correlated join predicate was NOT suitable for use in transitive closure.

I think it's definitely time for somebody to start checking my work here,
because I'm venturing into areas that are quite unfamiliar to me. Can
somebody who is comfortable with the concepts of
 - query flattening
 - correlated NOT EXISTS subqueries, and
 - transitive closure of equijoins
have a look at this query and offer an opinion about whether it was valid
for Derby to flatten and then transitively close the join in this fashion?

A B added a comment - 21/Sep/07 12:06 AM
Hi Bryan,

I did some tracing through the relevant areas of code and it sounds to me like you are very much on track. I was able verify all of the observations (#1 thru #5) from your previous comment, and I confirmed that the NOT EXISTS predicate gives us a non-existent row with which we then try to join to DICOM_SEND_REQUESTS.

So I think your investigation here has been great. Based on what you have posted so far, I tend to agree that the join transitive closure should not have created the new join predicate. If you'd like to consider that approach as a solution to this issue, I think it might be a good one. One other potential consideration is that maybe the NOT EXISTS subquery, which is correlated to the outer query, should not have been flattened in the first place? I'm not sure what the implications of correlation and subquery flattening is, but that might be something worth looking up in more detail.

In any event, thank you very much for all of your time and effort on this one. I do think you are making great progress here, and it seems like a solution is starting to take shape...

Bryan Pendleton added a comment - 21/Sep/07 01:48 AM
Army, thanks very much for the timely feedback! I tried a simple
change to PredicateList.joinClauseTransitiveClosure to omit
NOT EXISTS join predicates from consideration for closure,
and the initial results were promising: no crash, and I think I
am getting the right results.

Attached is noClosureOfNotExists.diff, which is not for commit,
just to share the work I've done so far.

I'll continue testing this patch, to see if I can improve my
confidence that it delivers the right results and doesn't break other processing.

Bryan Pendleton added a comment - 21/Sep/07 04:38 AM
noClosureOfNotExists patch achieves a 100% pass rate on "ant junitreport".
So that's a good 1st-order check that it doesn't break anything obvious.

Bryan Pendleton added a comment - 21/Sep/07 05:15 PM
I no longer think this bug is related to DERBY-2916 or DERBY-2775, so I removed those links.

Bryan Pendleton added a comment - 23/Sep/07 03:57 PM
Attached is 'repro2.java'.

I spent a bunch of time trying to write a "from scratch" reproduction
program to demonstrate the bug, but with no success. In the
debugger, I can see that the compiler erroneously inserts the
transitive join closure predicate, but I haven't been able to force
the optimizer to end up with a query plan where the bad predicate matters.

I was thinking I might try to use the optimizer overrides feature to
force the join plan that I want, but I need to go study that feature first.

In the meantime, I attached the first version of the repro program
in case somebody else can see how to trick it into reproducing the bug.

Bryan Pendleton added a comment - 24/Sep/07 04:13 PM
I've succeeded in using the optimizer overrides to force nested loop joins,
and to force the use of the secondary index access paths, but I haven't
yet figured out how to force the join order that I want using the optimizer overrides.

It's tricky because the optimizer overrides are specified on the query
before it's been flattened, but I want to force a particular join order after flattening.

A B added a comment - 24/Sep/07 05:12 PM
> It's tricky because the optimizer overrides are specified on the query
> before it's been flattened, but I want to force a particular join order
> after flattening.

Hmm...not sure if that's possible with current override support or not?

One thing that influences join order is the relative size of the tables in question. I noticed that for the original database, the row counts are:

  DSR: 3240
  ST: 20
  dispatcher_locks: 0

while in the database for "repro2" they are:

  DSR: 1005
  ST: 10003
  dispatcher_locs: 0

Notice how DSR is 15x as large as ST in the first case, while ST is 10x as large as DSR in the second case. Perhaps that has something to do with why the optimizer chooses different join orders for the two repros...?

Bryan Pendleton added a comment - 25/Sep/07 12:56 AM
Thanks for the suggestion, Army. I've been tinkering with the table sizes in the
repro. However, it doesn't seem to affect the optimizer's behavior. I don't
think the optimizer is being told the correct information about the table sizes.

Perhaps there is something I need to do to communicate the table sizes to
the optimizer, other than simply inserting more rows into the table(s) prior to
running the test?


Bryan Pendleton added a comment - 25/Sep/07 04:02 AM
I fiddled around with repro2.java, and have attached an updated version.
Manipulating the table sizes and table row statistics in various ways
didn't succeed in convincing the optimizer to choose the join order I wanted.

I re-wrote the query a different way, and then I seemed to get the join
order I wanted, but my query rewrite resulted in re-casting the query
as 2 select nodes with 2 tables each, rather than a single node with 3
tables, and hence the compiler never peformed transitive closure of
the predicates, so that query didn't tickle the bug either.

Drat. I'll sleep on it, and maybe some new ideas will occur.

A B added a comment - 25/Sep/07 04:52 PM
I switched Bryan's repro to use primary and foreign key constraints instead of explicit indexes. That in turn meant that I had to update the INSERT statements to satisfy the foreign key constraints.

Once I did that, I ran the query in "repro2.java" with _no_ optimizer overrides and while it did not fail, it *did* show the desired join order, which is a step in the right direction.

Then I (accidentally) swapped the order of the FROM tables in the top-level query (i.e. I put "DSR" first, then "ST", instead of the other way around), and when I did that the query failed with an NPE. Odd?

The bad news is that the stack trace for *this* NPE is different from the one reported in this issue. Nonetheless, it follows the same pattern: i.e. it does not occur in 10.1.3.1, and if I back out the changes for DERBY-1357 it goes away. So I'm not sure if this is the same issue or not?

Attaching as "repro3.java"...

Bryan Pendleton added a comment - 26/Sep/07 02:12 AM
This new failure is very interesting! My initial reaction is that it is the same problem,
with different symptoms. With the new (swapped tables) query:
 - we still go through exists flattening
 - we still go through transitive closure, and add a (erroneous) equijoin predicate
 - just before the crash, we trip the mysterious "if" statement in getColumnFromRow()
   and return a null column value
 - applying the noClosureOfNotExists.diff patch removes the NPE and makes the
   query run correctly.
I'll keep stepping through the new NPE, as well as your modified query with the
previous query plan "shape", to continue to understand it better.

Bryan Pendleton added a comment - 28/Sep/07 04:03 PM
I've persuaded myself that Army's repro3 is reproducing the same basic problem,
and so I'm going to move forward with building a complete patch using repro3
as the core regression test.

I'm considering adding the new regression test to ExistsWithSetOpsTest.java,
which isn't a perfect location, but I was looking for an existing JUnit test that
covered much of the same areas of the code, and that one seemed not too
implausible. Is there a better location for the new test? Specifically, are there
existing JUnit tests that specifically cover subquery flattening?

I'm also intending to open a separate JIRA to continue the investigation of
the mystery "if" statement in BaseActivation.getColumnFromRow. It is my
belief that that "if" statement can (and should) be removed.

A B added a comment - 28/Sep/07 04:45 PM
Thank you for persistence with this issue, Bryan. I'm glad repro3 appears to have been helpful.

> Specifically, are there existing JUnit tests that specifically cover subquery flattening?

I don't see any existing JUnit tests, no. The "subqueryFlattening.sql" test seems like the most relevant place, but as that is not JUnit, I can understand why you would choose to avoid that.

If adding a new JUnit test is undesirable and you do decide to extend ExistsWithSetOpsTest, do you also plan to rename that test to make it more appropriate? It would be nice to choose a name that more generally describes the types of tests in that file. I think the common ground here is EXISTS with subqueries...though not necessarily "flattening" since the queries currently in ExistsWithSetOpsTest do not perform flattening per se, they just "transform" the EXISTS subquery into something else...

Bryan Pendleton added a comment - 29/Sep/07 03:33 PM
Attached is 'patchWithRenamedTest.diff', a patch proposal.

The 'svn stat' for this patch is:
M java/engine/org/apache/derby/impl/sql/compile/FromList.java
M java/engine/org/apache/derby/impl/sql/compile/PredicateList.java
M java/testing/org/apache/derbyTesting/functionTests/tests/lang/_Suite.java
A java/testing/org/apache/derbyTesting/functionTests/tests/lang/ExistsWithSubqueriesTest.java
D java/testing/org/apache/derbyTesting/functionTests/tests/lang/ExistsWithSetOpsTest.java

The patch contains a regression test for this issue, modeled on
the repro3.java test case and inserted into the (renamed)
ExistsWithSubqueriesTest.java.

The core of the patch is the same, which is to notice that a join
predicate mentions a table which has the NOT EXISTS usage,
and to exclude that predicate from transitive join consideration.

I think this patch is worthy of review; please have a look when you have a chance.

A B added a comment - 01/Oct/07 04:02 PM
Thank you for the patch, Bryan. The diff is small and contained and looks functionally correct to me. I ran the new regression test without your patch and it failed, and with your patch it succeeds.

I do think it might be nice to add explanatory comments to the relevant engine code, so that future readers know *why* it is that we are checking for a "NOT EXISTS" query. If you'd rather not add comments to the transitive closure code itself, you could perhaps add comments to the FromList.tableNumberIsNotExists() method explaining the scenario in which it is used. I think there are ample comments in the new ExistsWithSubqueriesTest file, so maybe you could just copy the important bits into the engine code comments, as well?

Aside from that, and assuming derbyall and suites.All ran cleanly, I'm +1 to the commit of this patch. Many thanks for picking this one up, and for putting so much time into tracking down the cause/solution!

Bryan Pendleton added a comment - 02/Oct/07 02:59 AM
I agree, comments are definitely important for this change.
"I meant to add them before but forgot" :) Thanks for catching this!

Attached is a revised patch; the only change is improved comments
in FromList and PredicateList.

I'm intending to run a complete set of tests, verify that javadoc
builds cleanly, and then submit this patch. Continued feedback
is very welcome of course.

Bryan Pendleton added a comment - 02/Oct/07 02:55 PM
derbyall and junit suites.All were 100% clean.

Bryan Pendleton added a comment - 04/Oct/07 05:17 AM
generated javadoc looks ok, now that I have my jdk 1.5 build envt set up.

Bryan Pendleton added a comment - 04/Oct/07 03:14 PM
Committed the patchWithImprovedComments to the trunk as revision 581943.

I intend to merge this patch back to the 10.3 line in the next week.

A B added a comment - 04/Oct/07 05:34 PM
> Committed the patchWithImprovedComments to the trunk as revision 581943.

Thanks Bryan!

Bryan Pendleton added a comment - 06/Oct/07 02:58 AM
I merged the trunk patch back to the 10.3 branch without issues, and my build and test runs were clean, so I committed the patch to the 10.3 branch as revision 582439.