This note attempts to elaborate on some of the behaviors observed while studying DERBY-3094.
While running the following script, a NullPointerException occurs inside the generated bytecode.
create table xx (a double, b double);
insert into xx values (2, 3);
select a, a*(b/100.000000), count(*) from xx group by a, a*(b/100.000000);
Here's a snip of the stack trace from the NPE:
java.lang.NullPointerException
at org.apache.derby.exe.ac601a400fx0116x5cddx8367x000000107e400.e3(Unknown Source)
at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:145)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.doProjection(ProjectRestrictResultSet.java:488)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:291)
at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(BasicNoPutResultSetImpl.java:463)
at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(EmbedResultSet.java:424)
at org.apache.derby.impl.jdbc.EmbedResultSet.next(EmbedResultSet.java:368)
The problem occurs while executing the generated projection method.
Interestingly, the following slight variations of the query do not exhibit the NPE:
select a, a*(b/100.000000), count(*) from xx group by a, b, a*(b/100.000000);
select a, a*(b/100.000000), count(*) from xx group by a*(b/100.000000), a;
a*(b/100.000000) is the compound expression,
and note that the problem occurs when we GROUP BY column a,
but not column b, and a appears in the GROUP BY
clause prior to a*(b/100.000000).
When debugging a problem which occurs in the execution of generated code, some extra steps are required. I used the techniques described at http://wiki.apache.org/db-derby/DumpClassFile to capture and examine the generated code.
It's not the easiest to read, but here is a decompilation of the
generated projection method e3, which is the method
which encounters the NPE:
public Object e3()
throws StandardException, Exception
{
e6;
2;
NumberDataValue numberdatavalue = (NumberDataValue)getColumnFromRow(2, 2);
numberdatavalue;
numberdatavalue;
new BigDecimal("100.000000");
getDataValueFactory();
JVM INSTR swap ;
null;
getDecimalDataValue();
e7;
divide();
JVM INSTR dup ;
this;
JVM INSTR swap ;
e7;
JVM INSTR dup ;
(NumberDataValue)getColumnFromRow(1, 1);
JVM INSTR swap ;
e8;
times();
JVM INSTR dup ;
this;
JVM INSTR swap ;
e8;
(DataValueDescriptor);
setColumn();
return e6;
}
The NullPointerException occurs on the first call to getColumnFromRow().
If you read through this decompiled code "gently", you can see that this
code is computing a*(b/100.000000):
This method is very simple. Here it is, in its entirety:
protected final DataValueDescriptor getColumnFromRow(int rsNumber, int colId)
throws StandardException {
if( row[rsNumber] == null)
{
/* This actually happens. NoPutResultSetImpl.clearOrderableCache attempts to prefetch invariant values
* into a cache. This fails in some deeply nested joins. See Beetle 4736 and 4880.
*/
return null;
}
return row[rsNumber].getColumn(colId);
}
Note that this call has been generated to fetch the value of column "b":
(NumberDataValue)getColumnFromRow(2, 2);
while this call has been generated to fetch the value of column "a":
(NumberDataValue)getColumnFromRow(1, 1);
It seems reasonable that column 1 is column "a", and column 2 is column "b", but why are we looking in result set number 2 for column b, but looking in result set number 1 for column a?
In fact, when I stopped at this point in the debugger, the activation has a completely legitimate row in result set 1, but result set 2 is empty, so I believe that the essence of the bug is:
The generated code for the compound expression contains errors. Some of the column references are generated against the correct result set, while other column references are generated against the wrong result set.
The bulk of the processing for compiling and generating code for a GROUP BY clause occurs in GroupByNode.java, in the init() method. The overall processing is sketched out by these comments in the init() method:
/*
** The first thing we do is put ourselves on
** top of the SELECT. The select becomes the
** childResult. So our RCL becomes its RCL (so
** nodes above it now point to us). Map our
** RCL to its columns.
*/
/*
** We have aggregates, so we need to add
** an extra PRNode and we also have to muck around
** with our trees a might.
*/
In releases of Derby up through 10.2, GROUP BY processing was implemented by re-writing the query at parse time as a sub-select. This algorithm was changed for Derby 10.3 and a completely different implementation is now used. For background, see these JIRA issues: https://issues.apache.org/jira/browse/DERBY-681 and https://issues.apache.org/jira/browse/DERBY-883
The tree manipulation code for GROUP BY processing occurs primarily
in the methods addNewPRNode(), addUnAggColumns(),
and addAggregateColumns.
The crucial bit of the processing is well described by the javadoc:
http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/sql/compile/GroupByNode.html#addNewColumnsForAggregation()
Query execution processing involves a tree or pipeline of intermediate query results. Virtual Column Nodes (VCNs) are a mechanism for manipulating columns in intermediate results sets (as opposed to columns in actual user tables, which are described using BaseColumn instances.
For example, if we fetch some data, then sort it, then aggregate the results, then apply a HAVING clause to the results, this is represented internally by:
GroupByNode.addUnAggColumns is the crucial method with respect to DERBY-3094. This method has the following pseudo-code outline:
for each entry in the GROUP BY list
Allocate a new ResultColumn into the child PRNode
Allocate a new RC into the GroupBy RCL
Allocate a new VirtualColumnNode referring to the GroupBy RC
Replace each matching expression in the parent PRN's RCL
with a reference to the VCN instead. This is how the final
ProjectRestrictNode's ResultColumn instances are set to
refer to the results of the GROUP BY clause.
Replace each machine expression in the parent PRN's HAVING
clause in the same fashion, since the HAVING clause, too,
refers to the results of the GROUP BY clause.
end for
The code which modifies the parent ResultColumn expressions to reset them to contain VCN pointers is as follows:
// we replace each group by expression
// in the projection list with a virtual column node
// that effectively points to a result column
// in the result set doing the group by
SubstituteExpressionVisitor se =
new SubstituteExpressionVisitor(
gbc.getColumnExpression(),
vc,
AggregateNode.class);
parent.getResultColumns().accept(se);
The Visitor pattern is well known in programming contexts, but this visitor pattern is slightly different than the typical ones:
public Visitable accept(Visitor v)
throws StandardException
{
Visitable returnNode = v.visit(this);
if (v.skipChildren(this))
{
return returnNode;
}
if (expression != null && !v.stopTraversal())
{
expression = (ValueNode)expression.accept(v);
}
return returnNode;
}
Note that the value returned from the visitor is used to reset the
ResultColumn's expression. This means that a call to accept() can
alter the expression values in ResultColumn instances that it visits
SubstituteExpressionVisitor, in fact, does exactly this alteration. When it detects that the value being visited is equivalent to the source expression it is looking for, it returns the target expression instead of returning the node that it's visiting:
public Visitable visit(Visitable node) throws StandardException
{
if (!(node instanceof ValueNode))
{
return node;
}
ValueNode nd = (ValueNode)node;
if (nd.isEquivalent(source))
{
return target;
}
else
{
return node;
}
}
This is a rather subtle aspect of SubstituteExpressionVisitor.
We now have enough context, I hope, to discuss the actual bug. Recall the select statement of interest:
select a, a*(b/100.000000), count(*) from xx group by a, a*(b/100.000000);
When GroupByNode.init calls addUnAggColumns(), the situation is as follows:
The essence of the fix that I propose is to treat expressions as expressions, and individual column references as individual column references, when subtituting VCN values up into the parent's RCL. To do this, I propose to modify the addUnAggColumns() loop so that it uses a two-pass technique, subtituting expressions inline, but deferring substitutions of individual column references until after all expressions have been processed.
Here's the proposed diff:
- SubstituteExpressionVisitor se =
- new SubstituteExpressionVisitor(
- gbc.getColumnExpression(),
- vc,
+ //
+ // Note that we perform the VCN replacements
+ // for compound GROUP BY expressions inline, while
+ // we defer the VCN replacements for simple
+ // GROUP BY column references until after all
+ // the expressions have been processed. This is
+ // because a compound expression may contain a
+ // reference to a simple grouped column, but in
+ // such a case we want to process the expression
+ // as an expression, not as individual column
+ // references. E.g., if the statement was:
+ // SELECT ... GROUP BY C1, C1 * (C2 / 100), C3
+ // then we don't want the replacement of the
+ // simple column reference C1 to affect the
+ // compound expression C1 * (C2 / 100). DERBY-3094.
+ //
+ ValueNode vn = gbc.getColumnExpression();
+ SubstituteExpressionVisitor vis =
+ new SubstituteExpressionVisitor(vn, vc,
AggregateNode.class);
- parent.getResultColumns().accept(se);
+ if (vn instanceof ColumnReference)
+ referencesToSubstitute.add(vis);
+ else
+ parent.getResultColumns().accept(vis);
// Since we always need a PR node on top of the GB
// node to perform projection we can use it to perform
@@ -414,15 +435,23 @@
// GBN (RCL) -> (C1, SUM(C2), <input>, <aggregator>, MAX(C3), <input>, <aggregator>
// |
// FBT (C1, C2)
- if (havingClause != null) {
+ if (havingClause != null)
+ {
SubstituteExpressionVisitor havingSE =
- new SubstituteExpressionVisitor(
- gbc.getColumnExpression(),
- vc, null);
- havingClause.accept(havingSE);
+ new SubstituteExpressionVisitor(vn,vc,null);
+ if (vn instanceof ColumnReference)
+ havingRefsToSubstitute.add(havingSE);
+ else
+ havingClause.accept(havingSE);
}
gbc.setColumnPosition(bottomRCL.size());
}
+ for (int r = 0; r < referencesToSubstitute.size(); r++)
+ parent.getResultColumns().accept(
+ (SubstituteExpressionVisitor)referencesToSubstitute.get(r));
+ for (int r = 0; r < havingRefsToSubstitute.size(); r++)
+ havingClause.accept(
+ (SubstituteExpressionVisitor)havingRefsToSubstitute.get(r));
Note that, before the fix, we were trying to calculate the expression "a*(b/100.000000)" twice, once before the group by, and once after. When the new code runs, it replaces all the elements in the parent's RCL with VCN references, which means that no actual projection method is needed, and the generated code simply emits the GROUP BY results directly, without any further processing (this is hard to see without decompiling the generated code and inspecting it in detail, which I'm skipping for the purposes of this note.
The following statement also fails:
select a, a*(b/100.000000), count(*) from xx group by a, a*(b/100.000000)
having a*(b/100.000000) < 1;
The reason is the same (the substitute expression visitor damages the parent's having clause expression in the first pass, so it isn't properly recognized in the second pass), and the fix is the same.