Issue Details (XML | Word | Printable)

Key: DERBY-3094
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Critical Critical
Assignee: Bryan Pendleton
Reporter: Peter Balon
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Derby

Grouping of expressions causes NullPointerException

Created: 02/Oct/07 08:12 AM   Updated: 02/Mar/08 04:16 PM
Return to search
Component/s: SQL
Affects Version/s: 10.3.1.4
Fix Version/s: 10.3.3.0, 10.4.1.3

Time Tracking:
Not Specified

File Attachments:
  Size
HTML File Licensed for inclusion in ASF works d3094_followup.htm 2008-02-22 06:33 PM A B 43 kB
File Licensed for inclusion in ASF works modifyVisitorDoesntWork.diff 2008-02-12 03:33 AM Bryan Pendleton 1 kB
HTML File Licensed for inclusion in ASF works notes.html 2008-02-22 03:39 PM Bryan Pendleton 20 kB
File Licensed for inclusion in ASF works sortExpressions.diff 2008-02-27 03:33 AM Bryan Pendleton 10 kB
File Licensed for inclusion in ASF works sortExpressionsFinal.diff 2008-02-29 04:22 AM Bryan Pendleton 11 kB
File Licensed for inclusion in ASF works twoPass.diff 2008-02-18 10:59 PM Bryan Pendleton 3 kB
File Licensed for inclusion in ASF works TwoPassForHavingClauseAlso.diff 2008-02-20 08:19 PM Bryan Pendleton 7 kB
File Licensed for inclusion in ASF works TwoPassVisitor.diff 2008-02-19 06:31 PM Bryan Pendleton 3 kB
File Licensed for inclusion in ASF works TwoPassVisitorWithCommentsAndTests.diff 2008-02-20 05:18 AM Bryan Pendleton 7 kB
Environment: Windows XP, Eclipse 3.2.2, java 1.5.0.11
Issue Links:
Blocker
 
Reference

Resolution Date: 02/Mar/08 04:16 PM


 Description  « Hide
Following steps to reproduce the bug:

create table xx (a double, b double);
insert into xx values (2, 3);

select a, a*(b/100.000000), count(*) from xx group by a, a*(b/100.000000);

Starting run

select a, a*(b/100.000000), count(*) from xx
group by a, a*(b/100.000000)

Run successful

SQL State = 38000 SQL Code = 20000 SQL Message = Bei der Auswertung eines Ausdrucks wurde die Ausnahme 'java.lang.NullPointerException' ausgelöst. Exception message = java.sql.SQLException: Bei der Auswertung eines Ausdrucks wurde die Ausnahme 'java.lang.NullPointerException' ausgelöst.

Work around:

select a, a*(b/100.000000), count(*) from xx group by a, b, a*(b/100.000000)

Stack trace from application:

java.sql.SQLException: Bei der Auswertung eines Ausdrucks wurde die Ausnahme 'java.lang.NullPointerException' ausgelöst.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedResultSet.closeOnTransactionError(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedResultSet.next(Unknown Source)
at de.arcor.billy.report.views.designer.ReportViewerView.setInput(ReportViewerView.java:255)
at de.arcor.billy.report.views.designer.ReportViewerView.createPartControl(ReportViewerView.java:113)
at org.eclipse.ui.internal.ViewReference.createPartHelper(ViewReference.java:332)
at org.eclipse.ui.internal.ViewReference.createPart(ViewReference.java:197)
at org.eclipse.ui.internal.WorkbenchPartReference.getPart(WorkbenchPartReference.java:566)
at org.eclipse.ui.internal.Perspective.showView(Perspective.java:1675)
at org.eclipse.ui.internal.WorkbenchPage.busyShowView(WorkbenchPage.java:987)
at org.eclipse.ui.internal.WorkbenchPage.access$13(WorkbenchPage.java:968)
at org.eclipse.ui.internal.WorkbenchPage$13.run(WorkbenchPage.java:3514)
at org.eclipse.swt.custom.BusyIndicator.showWhile(BusyIndicator.java:67)
at org.eclipse.ui.internal.WorkbenchPage.showView(WorkbenchPage.java:3511)
at de.arcor.billy.report.data.ReportDataAdvisor$2.perspectiveChanged(ReportDataAdvisor.java:268)
at de.arcor.billy.system.actions.AbstractOpenPerspectiveActionDelegate$1.run(AbstractOpenPerspectiveActionDelegate.java:66)
at org.eclipse.swt.widgets.RunnableLock.run(RunnableLock.java:35)
at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Synchronizer.java:123)
at org.eclipse.swt.widgets.Display.runAsyncMessages(Display.java:3325)
at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:2971)
at org.eclipse.ui.internal.Workbench.runEventLoop(Workbench.java:1930)
at org.eclipse.ui.internal.Workbench.runUI(Workbench.java:1894)
at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:422)
at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:149)
at de.arcor.billy.product.Billy.run(Billy.java:15)
at org.eclipse.core.internal.runtime.PlatformActivator$1.run(PlatformActivator.java:78)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:92)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:68)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:400)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:177)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.eclipse.core.launcher.Main.invokeFramework(Main.java:336)
at org.eclipse.core.launcher.Main.basicRun(Main.java:280)
at org.eclipse.core.launcher.Main.run(Main.java:977)
at org.eclipse.core.launcher.Main.main(Main.java:952)
Caused by: java.sql.SQLException: Java-Ausnahme: ': java.lang.NullPointerException'.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.javaException(Unknown Source)
at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
... 42 more
Caused by: java.lang.NullPointerException
at org.apache.derby.exe.ac9b638174x0115x5f93x1332x0000046fd8a01b.e10(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.derby.impl.services.reflect.ReflectMethod.invoke(Unknown Source)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.doProjection(Unknown Source)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(Unknown Source)
at org.apache.derby.impl.sql.execute.ScrollInsensitiveResultSet.getNextRowFromSource(Unknown Source)
at org.apache.derby.impl.sql.execute.ScrollInsensitiveResultSet.getNextRowCore(Unknown Source)
at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(Unknown Source)
... 37 more

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Bryan Pendleton added a comment - 02/Oct/07 02:11 PM
Possibly the same as DERBY-2352, which may have a different error message
because it's in a SANE build and this one is in an INSANE build.

Thomas Nielsen added a comment - 20/Nov/07 11:52 AM
I reproduced this in my sandbox even with the ready-to-commit patch for DERBY-2352 applied, so it's not a dup of 2352.

Good news is it throws the NPE with both sane and insane builds, and here's the callstack with line numbers
---
2007-11-20 11:43:05.545 GMT Thread[main,5,main] (XID = 173), (SESSIONID = 0), (DATABASE = d3094), (DRDAID = null), Failed Statement is: select a, a*(b/100.000000), count(*) from xx group by a, a*(b/100.000000)
ERROR 38000: The exception 'java.lang.NullPointerException' was thrown while evaluating an expression.
        at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:294)
        at org.apache.derby.iapi.error.StandardException.unexpectedUserException(StandardException.java:554)
        at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:164)
        at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.doProjection(ProjectRestrictResultSet.java:488)
        at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:291)
        at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(BasicNoPutResultSetImpl.java:463)
        at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(EmbedResultSet.java:424)
        at org.apache.derby.impl.jdbc.EmbedResultSet.next(EmbedResultSet.java:368)
        at org.apache.derby.tools.JDBCDisplayUtil.indent_DisplayResults(JDBCDisplayUtil.java:382)
        at org.apache.derby.tools.JDBCDisplayUtil.indent_DisplayResults(JDBCDisplayUtil.java:338)
        at org.apache.derby.tools.JDBCDisplayUtil.indent_DisplayResults(JDBCDisplayUtil.java:241)
        at org.apache.derby.tools.JDBCDisplayUtil.DisplayResults(JDBCDisplayUtil.java:229)
        at org.apache.derby.impl.tools.ij.utilMain.displayResult(utilMain.java:435)
        at org.apache.derby.impl.tools.ij.utilMain.doCatch(utilMain.java:509)
        at org.apache.derby.impl.tools.ij.utilMain.runScriptGuts(utilMain.java:350)
        at org.apache.derby.impl.tools.ij.utilMain.go(utilMain.java:248)
        at org.apache.derby.impl.tools.ij.Main.go(Main.java:215)
        at org.apache.derby.impl.tools.ij.Main.mainCore(Main.java:181)
        at org.apache.derby.impl.tools.ij.Main.main(Main.java:73)
        at org.apache.derby.tools.ij.main(ij.java:59)
Caused by: java.lang.NullPointerException
        at org.apache.derby.exe.ac601a400fx0116x5cddx8367x000000107e400.e3(Unknown Source)
        at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:145)
        ... 17 more
============= begin nested exception, level (1) ===========
java.lang.NullPointerException
        at org.apache.derby.exe.ac601a400fx0116x5cddx8367x000000107e400.e3(Unknown Source)
        at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:145)
        at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.doProjection(ProjectRestrictResultSet.java:488)
        at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:291)
        at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(BasicNoPutResultSetImpl.java:463)
        at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(EmbedResultSet.java:424)
        at org.apache.derby.impl.jdbc.EmbedResultSet.next(EmbedResultSet.java:368)
        at org.apache.derby.tools.JDBCDisplayUtil.indent_DisplayResults(JDBCDisplayUtil.java:382)
        at org.apache.derby.tools.JDBCDisplayUtil.indent_DisplayResults(JDBCDisplayUtil.java:338)
        at org.apache.derby.tools.JDBCDisplayUtil.indent_DisplayResults(JDBCDisplayUtil.java:241)
        at org.apache.derby.tools.JDBCDisplayUtil.DisplayResults(JDBCDisplayUtil.java:229)
        at org.apache.derby.impl.tools.ij.utilMain.displayResult(utilMain.java:435)
        at org.apache.derby.impl.tools.ij.utilMain.doCatch(utilMain.java:509)
        at org.apache.derby.impl.tools.ij.utilMain.runScriptGuts(utilMain.java:350)
        at org.apache.derby.impl.tools.ij.utilMain.go(utilMain.java:248)
        at org.apache.derby.impl.tools.ij.Main.go(Main.java:215)
        at org.apache.derby.impl.tools.ij.Main.mainCore(Main.java:181)
        at org.apache.derby.impl.tools.ij.Main.main(Main.java:73)
        at org.apache.derby.tools.ij.main(ij.java:59)
--

ProjectRestrictResultSet seems to consider the obvious case where projection == null, but it still fails
   // Use reflection to do as much of projection as required
   if (projection != null)
   {
      result = (ExecRow) projection.invoke(activation); <== ln 488
   }
   else
   {
      result = mappedResultRow;
   }

Thomas Nielsen added a comment - 20/Nov/07 12:19 PM
Dropping the count(*) in the debugger actually works:

ij> select a, a*(b/100.000000) from xx group by a, a*(b/100.000000);
A |2
---------------------------------------------
2.0 |0.06

1 row selected

I traced through the ProjectRestrictResultSet code and its source GroupedAggregateResultSet, but could not immediately see where the NPE originated.

Thomas Nielsen added a comment - 20/Nov/07 12:55 PM
I wasn't really going to hunt this one down, but tracing a little further I see BaseActivation.getColumnFromRow() returns null @ 1359 with a comment that "this happens".

 if( row[rsNumber] == null)
        {
            /* This actually happens. NoPutResultSetImpl.clearOrderableCache attempts to prefetch invariant values
             * into a cache. This fails in some deeply nested joins. See Beetle 4736 and 4880.
             */
            return null; <==ln 1359
        }

This in turn makes BaseActivation.getDataValueFactory() return null, and in the end DataValueFactoryImpl.getDecimalDataValue(NumberValue) returning null. And the NPE surfaces.



Bryan Pendleton added a comment - 20/Nov/07 05:42 PM
Hi Thomas, thanks for investigating this. Your analysis is very interesting. You might
want to have a look at DERBY-3097, which refers to precisely the "if" statement that
you are studying.

Thomas Nielsen added a comment - 21/Nov/07 08:10 AM
The patch for DERBY-3097 removes the if() handling row[rsNumber] == null.
Not surprisingly we run into another NPE in BaseActivation on the return line following the original if() after patching.

   return row[rsNumber].getColumn(colId);

as row[rsNumber] still is null in our case.

It would initially seem both DERBY-3097 and this jira have the same, or a similar, root cause? Could it be the proposed patch for DERBY-3097 only remove the symptoms seen with the if(), and not the actual underlying problem?

If the patch for DERBY-3097 is applied (and fixes other problems), we will still see this one.

Thomas Nielsen added a comment - 21/Nov/07 08:21 AM
It also seems that this bug only manifests itself if there is an aggregation function *and* a decimal operation.

No aggregation works:
ij> select a, a*(b/100.000000) from xx group by a, a*(b/100.000000);
A |2
---------------------------------------------
2.0 |0.06

1 row selected

No decimal operation works:
ij> select a, b, count(*) from xx group by a, b;
A |B |3
---------------------------------------------------------
2.0 |3.0 |1

1 row selected

Seemingly independent of type of decimal operation:
ij> select a, a/b from xx group by a, a/b;
A |2
---------------------------------------------
2.0 |0.6666666666666666

1 row selected

But add the aggregate *and* the decimal operation, and we see the NPE.
ij> select a, a*b, count(*) from xx group by a, a*b;
A |2 |3
---------------------------------------------------------
ERROR 38000: The exception 'java.lang.NullPointerException' was thrown while evaluating an expression.
ERROR XJ001: Java exception: ': java.lang.NullPointerException'.
ij>

Bryan Pendleton added a comment - 21/Nov/07 02:53 PM
Hi Thomas, thanks for studying this problem. I think you should mark the issue as assigned
to you to indicate that you are actively studying it.

> row[rsNumber] still is null in our case

Sorry, Thomas, I think I wasn't very clear in my other comment.
DERBY-3097 is definitely not a fix for this problem;
it's related only in the sense that it discusses the getColumnFromRow() code.

My belief is that the "if" statement in getColumnFromRow is unnecessary and incorrect,
because there should be no situations in which Derby ever needs to fetch a column's
value from a row prior to the initialization/population of that row with valid values.

The various cases that I looked at before filing DERBY-3097 all involved situations in
which, due to other problems, Derby was accessing the internal row/column data structures
in an inconsistent manner.

For example, with this particular problem, I think that you could investigate how the
ResultColumn instances for the aggregate function and the arithmetic expression column
are built up, and how their row and column number values are manipulated during bind
processing.

Here are some things to pay particular attention to:
 - when the invalid call to getColumnFromRow is made, which actual table and column
   is being dereferenced?
 - what does the generated "e3" method for your case look like? You can view this code
   using the techniques described at http://wiki.apache.org/db-derby/DumpClassFile


Bryan Pendleton added a comment - 09/Feb/08 04:55 PM
Generated method "e3" is the "projection" method for the generated ProjectRestrictResultSet.

Here is a snippet of the generated code:

    private ResultSet fillResultSet()
        throws StandardException
    {
        getLanguageConnectionContext().getAuthorizer().authorize(this, 1);
        return getResultSetFactory().getScrollInsensitiveResultSet(
            getResultSetFactory().getProjectRestrictResultSet(
                getResultSetFactory().getGroupedAggregateResultSet(
                    getResultSetFactory().getProjectRestrictResultSet(
                        getResultSetFactory().getBulkTableScanResultSet(
                            this, 1024, 5, getMethod("e0"), 2, null, -1,
                            null, -1, true, e1, "XX", null, null, false,
                            false, -1, -1, 7, false, 0, 16, false,
                            6D, 100.404D),
                        null, getMethod("e1"), 3, null, 4, false, true,
                        6D, 100.404D),
                    false, 3, 2, getMethod("e2"), 224, 1, 6D, 100.404D),
                null, getMethod("e3"), 4, null, 1, false, true, 1D, 100.404D),
            this, 0, 3, getScrollable(), 1D, 100.404D);
    }

Note that getMethod("e3") is the 3rd argument to getProjectRestrictResultset()

Bryan Pendleton added a comment - 09/Feb/08 05:56 PM
Note that there are *two* ProjectRestrictResultSet instances in the tree:
 - the inner PRN operates on the bulk table scan, and projects columns
   "a" and "b" from table "xx" out of the table scan using projection "e1".
 - then the GroupedAggregateResultSet processes the rows, and
   computes the COUNT(*) into the 3rd column of the template row
 - then the outer PRN takes the grouped results and projects columns
   "a", "(expression)", and "COUNT(*)" out of the grouped result row.

At the instant of the crash, the outer PRN is performing that final
projection, and it is looking for a row from result set 2.

But this is wrong; it ought to be looking for a row from result set 1.

In the debugger, I can see that result set 1's current row is correct:
it has the right columns, with the right values.

I can also see that, *later* in projection method "e3", there is another
call to getColumnFromRow() which passes result set number 1.

So that seems like additional evidence that the core of the problem is
that projection method "e3" generated invalid code which is
trying to access result set 2 when it should have been accessing
result set 1.

Since the column in question is column 2, which is the expression
a*(b/100.000000), I'm wondering whether the inner PRN and the
outer PRN are somehow accidentally sharing the expression
information for this result column expression, when one of them should
have renumbered the result set for the expression to point to the
different result set.

This feels really familiar, like we just worked on some problems with
result set number handling and result column expressions recently;
I'll spend a bit of time searching JIRA to see if I can refresh that memory.

Bryan Pendleton added a comment - 10/Feb/08 12:49 AM
At generate time, as the code analyzes the expression a*(b/100.000000),
for the outer PRN, the sourceColumn's resultSetNumber for the
ColumnReference "a" is '1', while for "b" it is '2'.

So the overall expression is inconsistent; some of the column
reference point to the correct result set, others do not.

This makes me wonder whether there is some code which
traverses ResultColumn expression trees, updating the resultSetNumber,
that mistakenly visits the leftOperand of the expression tree but does
not visit the rightOperand.

"a" is seen while visiting the leftOperand, but to visit "b" one must visit
the rightOperand.

I thought this was a theory worth recording.

Bryan Pendleton added a comment - 10/Feb/08 02:02 AM
As Thomas noted, it's interesting to see the differences between what provokes
this problem, and what does not. For example, this query works fine:

   select a*(b/1.0), count(*) from xx group by a*(b/1.0);

but this one gets the NPE:

   select a*(b/1.0), count(*) from xx group by a,a*(b/1.0);

Bryan Pendleton added a comment - 10/Feb/08 04:59 PM
I've pretty much convinced myself that the problem happens during GroupByNode.init(),
where we take the original PRNode for the core SELECT statement, and re-write
that section of the tree to insert a new GroupByNode above that PRNode, and a new
PRNode above the GroupByNode, and manipulate the various ResultColumn lists
in various ways.

But I'm not sure precisely what's going wrong:
 - is ResultColumnList.cloneMe failing to process sub-expression column references?
 - are the bindResultColumnToExpression calls in addUnAggColumns supposed
   to examine the sub-expressions?
 - or perhaps somewhere else entirely in GroupByNode.init?

Bryan Pendleton added a comment - 10/Feb/08 08:19 PM
OK, I think I'm getting close. Note the following:
 - select a*(b/1.0), count(*) from xx group by a*(b/1.0); -- WORKS
 - select a*(b/1.0), count(*) from xx group by a,a*(b/1.0); -- FAILS
 - select a*(b/1.0), count(*) from xx group by a,b,a*(b/1.0); -- WORKS

I think the problem is related to these lines in GroupByNode.java
(near line 388 or so):

SubstituteExpressionVisitor se =
new SubstituteExpressionVisitor(
gbc.getColumnExpression(),
vc,
AggregateNode.class);
parent.getResultColumns().accept(se);

This code is processing the ResultColumn instances in the
"parent" (outer) PRNode to update their ColumnReference values.

But (I think) this code only updates the ColumnReference values
for those columns which are directly referenced as column references
in the grouping list. That's why the references to column "a" in the
expression are correctly fixed up, but ther references to column "b"
are not, because "b" wasn't directly in the group by list.

If true, this suggests a possible workaround: include all columns
that are mentioned by any expressions in the group by list
redundantly, as simple column references.

I'll continue to study GroupByNode.addUnAggColumns, to see if I
can construct a possible alternate way to process the columns.

Bryan Pendleton added a comment - 11/Feb/08 03:34 PM
Reversing the order of the items in the group by list also avoids the NPE:

 - select a*(b/1.0), count(*) from xx group by a*(b/1.0), a; -- WORKS

This possibly provides a simpler workaround.

Bryan Pendleton added a comment - 12/Feb/08 03:33 AM
My first attempt to fix this was a failure.

I made the repro script pass, but broke various other
GROUP BY constructions.

I'm attaching the diff anyway (modifyVisitorDoesntWork.diff),
because I feel like it's at least going in the right direction.

I'm convinced that the problem involves the various
manipulations of the result column lists performed by
GroupByNode.init, but I'm not yet sure what change to make.

I think it's instructive to note that, with this patch applied, the
following statement gets "b" and "a" backward:
ij> select b, a, count(*) from xx group by b, a;
B |A |3
---------------------------------------------------------
2.0 |3.0 |1

The proper answer, of course, would be (3.0, 2.0, 1)

Bryan Pendleton added a comment - 17/Feb/08 08:07 PM
I've been contemplating the following short script, trying to figure out what the
correct behavior of Derby ought to be. Note that I haven't yet thrown in any ORDER BY
or HAVING examples; I'm just trying to understand the interaction between the
select list and the group by list.

Some of these queries work, some are rejected, and one is the DERBY-3094 query.
Some of the queries which are rejected seem like they ought to work, and some of
the queries which work seem like they ought to be rejected.

create table test (c1 int, c2 int, c3 int, c4 int);
insert into test values (1, 10, 100, 1000);
insert into test values (1, 11, 100, 1001);
insert into test values (2, 10, 100, 1000);
insert into test values (2, 11, 101, 1001);
insert into test values (2, 11, 101, 1000);
select * from test;
select c1+c2, sum(c3) from test group by c1,c2;
select c1+c2, sum(c3) from test group by c1+c2;
select c1+c2, sum(c3) from test group by c1+c2,c1;
select c1,c2, sum(c3) from test group by c1,c2,c1+c2;
-- Next query is refused with 42Y30 (select list contains invalid expression):
select c1+c2, sum(c3) from test group by c1;
-- Next query gets a runtime NPE (DERBY-3094):
select c1+c2, sum(c3) from test group by c1,c1+c2;
-- Next query is refused with 42Y30 (select list contains invalid expression):
select c1,c2, sum(c3) from test group by c1+c2,c1;
-- Next query is refused with 42Y30 (select list contains invalid expression):
select c1+c2, sum(c3) from test group by 1;
-- Next query is refused with 42X04 (Column 'EXPR' not in any table)
select c1+c2 as expr, sum(c3) from test group by expr;
-- Next query is refused with 42X04 (Column 'C1A' not in any table)
select c1 as c1a, c2, sum(c3) from test group by c1a,c2;
select c1 as c2, sum(c3) from test group by c1,c2;
-- Next query is refused with 42Y30 (select list contains invalid expression):
select c1 as c2, sum(c3) from test group by c2;
select c1 as c2, sum(c3) from test group by c1;

The challenge is
a) to ensure that we understand what the intended behavior of Derby should be
b) to construct a modification to the current head of trunk that resolves the NPE
but still preserves all the other desired behaviors.

I'm hoping that collecting various test cases like this will help me understand this.

Bryan Pendleton added a comment - 18/Feb/08 10:37 PM
GroupByExpressionTest.java contains the following statement:

   select (c1+c2)+1, sum(c3) from test group by c1+c2

With the current Derby code, this statement is accepted as legal and produces results.

However, I'm not sure it ought to be accepted, as the select list contains
a non-aggregate expression which does not appear in the group by list.

Bryan Pendleton added a comment - 18/Feb/08 10:45 PM
GroupByExpressionTest.java also contains this very intriguing statement:

  select (c1+c2), sum(c3)+(c1+c2) from test group by c1+c2

With the current Derby code, this statement is accepted as legal and produces results.

Bryan Pendleton added a comment - 18/Feb/08 10:59 PM
Attached is 'twoPass.diff', a patch proposal.

The idea of this patch is that, during the GroupBy tree rewriting,
we replace ResultColumn expressions with VirtualColumnNode
references in two passes:
 - we first handle all expression replacements
 - then once all the expressions have been replaced,
   we replace all column references.

This patch fixes the repro script, and makes *almost all*
of the regression tests pass.

However, two of the test cases in GroupByExpressionTest.java
fail; these are the two test cases that I listed in my previous
two comments.

It's not clear to me whether these two test cases are valid or not,
they are certainly very unusual expressions and I think an argument
can be made that they should be rejected by the compiler as invalid.

Please consider this patch as "for discussion", intended to
record a possible alternate implementation, and to stimulate
some feedback regarding what the correct rules ought to be
for correlating elements in the GROUP BY list against elements
in the SELECT list.

A B added a comment - 19/Feb/08 12:01 AM
> It's not clear to me whether [two of the test cases in
> GroupByExpressionTest.java] are valid or not,

I have not yet had a chance to look at the patch, but I ran those two queries--along with all of the queries posted in your Feb 17 comment--against DB2 v8. From what I can tell, the (clean) Derby trunk and DB2 v8 agree in all of these cases--i.e. they either both throw an error or they return the same results (with the notable exception of the NPE that this issue is trying to fix). Not sure if that's helpful or not...

Bryan Pendleton added a comment - 19/Feb/08 01:28 AM
Here's an interesting essay on the sort of subtle problems that can arise
from not being crystal-clear about the semantics of column references and
expressions in GROUP BY clauses:
http://odetocode.com/Blogs/scott/archive/2005/04/17/1214.aspx

Bryan Pendleton added a comment - 19/Feb/08 01:53 AM
The Postgres docs (http://www.postgresql.org/docs/8.0/interactive/sql-select.html#SQL-GROUPBY)
have this to say about the column-name-versus-column-alias subtlety referred
to in the previous comment:

       In case of ambiguity, a GROUP BY name will be interpreted as an input-column name
       rather than an output column name.

It seems to me preferable to reject the ambiguous query rather than producing possibly
surprising results, but I can see why ambiguity-resolving behaviors are also worthwhile.

Bryan Pendleton added a comment - 19/Feb/08 06:31 PM
Attached is 'TwoPassVisitor.diff', the first working code proposal.
With this patch, the repro script passes, and all regression tests pass.

I still need to:
 - add a bunch of new test cases
 - add comments to the code
 - add a writeup explaining the ideas behind this patch

But at least I now have code that appears to add the desired
new behavior while not breaking any (known) existing behaviors.

Bryan Pendleton added a comment - 20/Feb/08 03:59 AM
Thanks Army for checking the queries against DB2, that's quite helpful.

Bryan Pendleton added a comment - 20/Feb/08 04:23 AM
It seems like there might be a similar problem with the HAVING clause. Consider:

ij> select c1+c2, sum(c3) from test group by c1+c2;
1 |2
-----------------------
11 |100
12 |200
13 |202

3 rows selected
ij> select c1+c2, sum(c3) from test group by c1,c1+c2;
1 |2
-----------------------
11 |100
12 |100
12 |100
13 |202

4 rows selected
ij> select c1+c2, sum(c3) from test group by c1+c2 having c1+c2 > 11;
1 |2
-----------------------
12 |200
13 |202

2 rows selected
ij> select c1+c2, sum(c3) from test group by c1, c1+c2 having c1+c2 > 11;
ERROR 42X24: Column C2 is referenced in the HAVING clause but is not in the GROUP BY list.

I think that the proper results for that last query should have been:
1 |2
-----------------------
12 |100
12 |100
13 |202

3 rows selected

We could either investigate HAVING behaviors as part of this JIRA, or we could open a new one.

Bryan Pendleton added a comment - 20/Feb/08 05:18 AM
Attached TwoPassVisitorWithCommentsAndTests.diff,
which adds a brief comment to the GroupByNode code
explaining the reasoning behind the two pass ResultColumn
rewriting algorithm, and adds a number of new tests to
GroupByExpressionTest.java.

A B added a comment - 20/Feb/08 06:13 PM
Thank you for the updated patch, Bryan. Is there a more detailed writeup forthcoming for this change, or are the code comments the extent of it?

The reason I ask is that, after reading the code comments and the comments in this issue, I can't quite see the correlation between the patch and the NPE that it fixes. That is, it's clear that the patch fixes the issue, but I think I'm missing the details on _why_. The code comments say:

 // then we don't want the replacement of the
 // simple column reference C1 to affect the
 // compound expression C1 * (C2 / 100).

but it's not clear to me how replacement of the simple column reference can negatively affect the compound expressions. Is it an issue of VCN's pointing to the wrong place? If so, any idea as to why that happens? Similarly, in an earlier comment you noted:

> I think it's instructive to note that, with this patch applied, the
> following statement gets "b" and "a" backward:
> ij> select b, a, count(*) from xx group by b, a;
> B |A |3
> ---------------------------------------------------------
> 2.0 |3.0 |1

Is it possible to say what it is about processing simple column references _after_ compound expressions that fixes this problem?

Apologies if I'm missing something obvious. I'm not by any means opposed to the patch, I'm just hoping to gain an understanding about why it resolves the discussion/issues raised thus far for this issue...

And thanks for your continued diligence with this one, Bryan.

Bryan Pendleton added a comment - 20/Feb/08 08:19 PM
Attached is TwoPassForHavingClauseAlso.diff, which
addresses the problem I noted with the HAVING clause
in an earlier comment.

Army, thanks for having a look at the patch. I am still
intending to complete a more thorough description
of the "why" behind the changes that should hopefully
make it more clear.

But for now, two quick responses:
1) substituting the column references with VCN's during
group-by rewriting changes which result set is pointed to,
but when we process the C1 portion of the compound
expression but not the compound expression as a whole,
we never correct the C2 column reference (since C2 by
itself is not a grouping column), leaving the compound
expression with C1 pointing to the correct result set, but C2
pointing to a non-existing result set, causing an NPE at runtime.
2) the earlier comment about getting "b" and "a" backward
was simply a red herring. That particular patch attempt
(modifyVisitorDoesntWork.diff) was simply flat-out wrong. :)

I'll try to get a more thorough writeup completed soon.

Bryan Pendleton added a comment - 22/Feb/08 03:39 PM
Attached an attempt at a descriptive writeup. Please let me know if it
makes the patch more comprehensible.

A B added a comment - 22/Feb/08 06:33 PM
Thanks a ton for the detailed write-up, Bryan, that definitely makes it clear as to what is going on with this issue.

In reading through the document I gained an understanding of what the problem is, and then I began wondering if this is only an issue when a ColumnReference appears in a compound expression--or is it also an issue when one expression appears within another expression. I came up with the following query:

  select a*((a+b)/1.0), count(*) from xx group by a+b, a*((a+b)/1.0)

which fails with an NPE even after TwoPassForHavingClauseAlso.diff is applied. I'm attaching d3094_followup.htm, which (in theory) walks through this query in the same way that your "notes.html" walked through another example. I did this mostly for my own understanding, but thought it might be useful to post it as a writeup to this issue.

To quote the conclusion:

"[T]he TwoPassVisitorWithCommentsAndTests.diff solves the specific query that was reported (and described in "notes.html"), and so it makes things better, which is great. Thus I think it would be fine to commit the patch as it is and either deal with queries like the one above in a follow-up patch, or file a separate Jira altogether."

Bryan Pendleton added a comment - 23/Feb/08 04:10 PM
Thanks Army! That's a keen insight about the underlying issue, and
a great writeup and test case.

Here's yet another variant that's hard to explain: the first statement
works, the second is refused with error 42Y30:

select (a+b)+c, count(*) from t group by c, a+b;
select a+(b+c), count(*) from t group by c, a+b;

I'm not quite sure how to push this problem further. The only idea
I had was as follows: perhaps we could automatically and silently
pull up *every* column reference that is present in any GROUP BY
expression, and include that column as an extra column reference
at the end of the group by list. So we could rewrite the above queries,
for example, as:

select (a+b)+c, count(*) from t group by c, a+b, a, b;
select a+(b+c), count(*) from t group by c, a+b, a, b;

That might mean we'd always have "enough" column references
so that we could perform substitution on all the result columns
in the select list successfully.

Yet I'm nervous about this strategy; we wouldn't want these pulled
up columns to be used for the actual grouping (because it could
subdivide the actual groups further than the user intended), so we'd
have to mark these columns as "special columns to be used for
resolving column references, but not to be used for grouping."

I'll continue to think about it, but I'm increasingly tempted to
take your suggestion to be satisfied with the patch as is, and
file some of these other issues separately to be studied more
in the future. This is similar to the way that we marked DERBY-1624
as resolved, while shifting the unresolved portion out to DERBY-2457.

Bryan Pendleton added a comment - 24/Feb/08 12:30 AM
Here's one other idea that might work:

 - Define the "level" of a GROUP BY expression to be
   the number of column references in that expression.

   For example, in this clause:

        GROUP BY a, b, a+b, a+(b+c), a*(b/100.0), a*((a+b)/1.0)

   the level of the various elements in the list are:

   - a: 1
   - b: 1
   - a+b: 2
   - a+(b+c): 3
   - a*(b/100.0): 2
   - a*((a+b)/1.0): 3

 - Then, when performing VCN substitution into the
    parent PRNode's ResultColumnList entries from
   the GroupByNode's entries, process the entires
   in descending order by level, and left-to-right within
   the same level. So the entries from the GROUP BY above
   would be processed in the order:

   a+(b+c), a*((a+b)/1.0), a+b, a*(b/100.0), a, b

Does this algorithm seem like it would be worth investigating?

A B added a comment - 25/Feb/08 05:44 PM
> Here's yet another variant that's hard to explain: the first statement
> works, the second is refused with error 42Y30:
>
> select (a+b)+c, count(*) from t group by c, a+b;
> select a+(b+c), count(*) from t group by c, a+b;

The tree for "(a+b)+c" contains the subtree "a+b" as the left-hand side of the top-most binary operator and "c" as the right-hand side. Since each of those matches a stand-alone expression in the GROUP BY, the equivalence checking for GROUP BY processing passes. But the tree for "a+(b+c)" has "a" on one side and "b+c" on the other, neither of which shows up as a stand-alone GROUP BY expression, so the second query is rejected. I agree it's a tad non-inuititive, but it appears to be "explainable", for what that's worth...

> Here's one other idea that might work:

[ snip ]

> Does this algorithm seem like it would be worth investigating?

Yes, I definitely think this is worth investigating. From what I can tell it solves the problem of sub-expression substitution, and it seems a tad safer than pulling unneeded columns into the GROUP BY expression.

There's a CollectNodesVisitor class which finds all instances of a specified class within the subtree beneath a given ValueNode, so you might be able to use that to count the number of ColumnReferences in the expression. Though ColumnReferences tend to appear in lots of places, so it'll be interesting to see if a simple CollectNodesVisitor is good enough--or will it end up counting ColumnReferences that shouldn't be counted...?

In any event, +1 to further pursuit of the suggested approach.

Bryan Pendleton added a comment - 27/Feb/08 03:33 AM
Attached is sortExpressions.diff, which proposes to sort
the expressions based on the number of ColumnReferences
that each contains, using the technique suggested by Army
of counting those refs using the CollectNodesVisitor.

It seems to handle all the existing test cases, as well
as a few new ones that I added to GroupByExpressionTest.java
as part of this patch.

Please have a look and tell me what you think.

A B added a comment - 27/Feb/08 08:05 PM
sortExpressions.diff looks good to me. Thanks for continuing with that approach, Bryan. Assuming derbyall and suites.All run cleanly, I say +1 to commit since it fixes all of the known queries and seems algorithmically correct as far as I can tell. If additional issues/queries are discovered later, I think they can be addressed as they are found.

One nit: it seems the creation of a new ArrayList for havingRefsToSubstitute could be conditional upon the presence of a non-null havingClause, and that a single ExpressionSorter() could be used instead of two. But in both cases the code may be easier to read as it is, so I'm not sure it's really worth changing...

Bryan Pendleton added a comment - 28/Feb/08 02:50 PM
Thanks Army, those are good suggestions.

I'm intending to move ahead with final testing, cleanup, and commit.

Bryan Pendleton added a comment - 29/Feb/08 04:22 AM
Attached is sortExpressionsFinal.diff, which incorporates
the last few suggestions from reviews, as well as some
additional comments.

derbyall and junit-all were cleaning, so I'm proceeding
to commit this patch.

Bryan Pendleton added a comment - 29/Feb/08 04:25 AM
Committed sortExpressionsFinal.diff to the trunk as revision 632221.

I'll investigate merging this change back to the 10.3 branch. Until then,
I'll leave this issue open.

Bryan Pendleton added a comment - 02/Mar/08 04:16 PM
I merged revision 632221 to the 10.3 branch without incident.
My build and test runs were clean, so I committed the
change to the 10.3 branch as revision 632779.

Marking the issue as resolved.