Some notes on DERBY-3094

This note attempts to elaborate on some of the behaviors observed while studying DERBY-3094.

The problem

While running the following script, a NullPointerException occurs inside the generated bytecode.

create table xx (a double, b double);
insert into xx values (2, 3);
select a, a*(b/100.000000), count(*) from xx group by a, a*(b/100.000000); 

Here's a snip of the stack trace from the NPE:

java.lang.NullPointerException
        at org.apache.derby.exe.ac601a400fx0116x5cddx8367x000000107e400.e3(Unknown Source)
        at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:145)
        at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.doProjection(ProjectRestrictResultSet.java:488)
        at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:291)
        at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(BasicNoPutResultSetImpl.java:463)
        at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(EmbedResultSet.java:424)
        at org.apache.derby.impl.jdbc.EmbedResultSet.next(EmbedResultSet.java:368)
The problem occurs while executing the generated projection method.

Interestingly, the following slight variations of the query do not exhibit the NPE:

select a, a*(b/100.000000), count(*) from xx group by a, b, a*(b/100.000000); 
select a, a*(b/100.000000), count(*) from xx group by a*(b/100.000000), a; 

The problem, abstractly

I believe that the preconditions for the problem are as follows: In this example, a*(b/100.000000) is the compound expression, and note that the problem occurs when we GROUP BY column a, but not column b, and a appears in the GROUP BY clause prior to a*(b/100.000000).

Examining the generated code

When debugging a problem which occurs in the execution of generated code, some extra steps are required. I used the techniques described at http://wiki.apache.org/db-derby/DumpClassFile to capture and examine the generated code.

Generated projection method

It's not the easiest to read, but here is a decompilation of the generated projection method e3, which is the method which encounters the NPE:

    public Object e3()
        throws StandardException, Exception
    {
        e6;
        2;
        NumberDataValue numberdatavalue = (NumberDataValue)getColumnFromRow(2, 2);
        numberdatavalue;
        numberdatavalue;
        new BigDecimal("100.000000");
        getDataValueFactory();
        JVM INSTR swap ;
        null;
        getDecimalDataValue();
        e7;
        divide();
        JVM INSTR dup ;
        this;
        JVM INSTR swap ;
        e7;
        JVM INSTR dup ;
        (NumberDataValue)getColumnFromRow(1, 1);
        JVM INSTR swap ;
        e8;
        times();
        JVM INSTR dup ;
        this;
        JVM INSTR swap ;
        e8;
        (DataValueDescriptor);
        setColumn();
        return e6;
    }
The NullPointerException occurs on the first call to getColumnFromRow().

If you read through this decompiled code "gently", you can see that this code is computing a*(b/100.000000):

BaseActivation.getColumnFromRow

This method is very simple. Here it is, in its entirety:

    protected final DataValueDescriptor getColumnFromRow(int rsNumber, int colId)
        throws StandardException {

        if( row[rsNumber] == null)
        {
            /* This actually happens. NoPutResultSetImpl.clearOrderableCache attempts to prefetch invariant values
             * into a cache. This fails in some deeply nested joins. See Beetle 4736 and 4880.
             */
            return null;
        }
        return row[rsNumber].getColumn(colId);
    }

An observation about the arguments to getColumnFromRow

Note that this call has been generated to fetch the value of column "b":

        (NumberDataValue)getColumnFromRow(2, 2);
while this call has been generated to fetch the value of column "a":
        (NumberDataValue)getColumnFromRow(1, 1);

It seems reasonable that column 1 is column "a", and column 2 is column "b", but why are we looking in result set number 2 for column b, but looking in result set number 1 for column a?

In fact, when I stopped at this point in the debugger, the activation has a completely legitimate row in result set 1, but result set 2 is empty, so I believe that the essence of the bug is:

The generated code for the compound expression contains errors. Some of the column references are generated against the correct result set, while other column references are generated against the wrong result set.

GROUP BY data structures and code generation

The bulk of the processing for compiling and generating code for a GROUP BY clause occurs in GroupByNode.java, in the init() method. The overall processing is sketched out by these comments in the init() method:

/*
** The first thing we do is put ourselves on
** top of the SELECT.  The select becomes the
** childResult.  So our RCL becomes its RCL (so
** nodes above it now point to us).  Map our
** RCL to its columns.
*/

/*
** We have aggregates, so we need to add
** an extra PRNode and we also have to muck around
** with our trees a might.
*/

An aside: GROUP BY processing history in Derby

In releases of Derby up through 10.2, GROUP BY processing was implemented by re-writing the query at parse time as a sub-select. This algorithm was changed for Derby 10.3 and a completely different implementation is now used. For background, see these JIRA issues: https://issues.apache.org/jira/browse/DERBY-681 and https://issues.apache.org/jira/browse/DERBY-883

Tree processing in GroupByNode.init()

The tree manipulation code for GROUP BY processing occurs primarily in the methods addNewPRNode(), addUnAggColumns(), and addAggregateColumns. The crucial bit of the processing is well described by the javadoc: http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/sql/compile/GroupByNode.html#addNewColumnsForAggregation()

VirtualColumnNodes

Query execution processing involves a tree or pipeline of intermediate query results. Virtual Column Nodes (VCNs) are a mechanism for manipulating columns in intermediate results sets (as opposed to columns in actual user tables, which are described using BaseColumn instances.

For example, if we fetch some data, then sort it, then aggregate the results, then apply a HAVING clause to the results, this is represented internally by:

A VCN in the final PRN would tend to refer to a column in the underlying GBN; this gives the code generator enough information to emit code for the HAVING clause to refer to the results of the GROUP BY clause.

GroupByNode.addUnAggColumns()

GroupByNode.addUnAggColumns is the crucial method with respect to DERBY-3094. This method has the following pseudo-code outline:

for each entry in the GROUP BY list
    Allocate a new ResultColumn into the child PRNode
    Allocate a new RC into the GroupBy RCL
    Allocate a new VirtualColumnNode referring to the GroupBy RC
	Replace each matching expression in the parent PRN's RCL
        with a reference to the VCN instead. This is how the final
        ProjectRestrictNode's ResultColumn instances are set to
        refer to the results of the GROUP BY clause.
    Replace each machine expression in the parent PRN's HAVING
        clause in the same fashion, since the HAVING clause, too,
        refers to the results of the GROUP BY clause.
end for

SubstituteExpressionVisitor

The code which modifies the parent ResultColumn expressions to reset them to contain VCN pointers is as follows:

            // we replace each group by expression
            // in the projection list with a virtual column node
            // that effectively points to a result column
            // in the result set doing the group by
            SubstituteExpressionVisitor se =
                new SubstituteExpressionVisitor(
                        gbc.getColumnExpression(),
                        vc,
                        AggregateNode.class);
            parent.getResultColumns().accept(se);

The Visitor pattern is well known in programming contexts, but this visitor pattern is slightly different than the typical ones:

Here's the code for ResultColumn.accept:
    public Visitable accept(Visitor v)
        throws StandardException
    {
        Visitable returnNode = v.visit(this);

        if (v.skipChildren(this))
        {
            return returnNode;
        }

        if (expression != null && !v.stopTraversal())
        {
            expression = (ValueNode)expression.accept(v);
        }
        return returnNode;
    }
Note that the value returned from the visitor is used to reset the ResultColumn's expression. This means that a call to accept() can alter the expression values in ResultColumn instances that it visits

SubstituteExpressionVisitor, in fact, does exactly this alteration. When it detects that the value being visited is equivalent to the source expression it is looking for, it returns the target expression instead of returning the node that it's visiting:

    public Visitable visit(Visitable node) throws StandardException
    {
        if (!(node instanceof ValueNode))
        {
            return node;
        }

        ValueNode nd = (ValueNode)node;
        if (nd.isEquivalent(source))
        {
            return target;
        }
        else
        {
            return node;
        }
    }
This is a rather subtle aspect of SubstituteExpressionVisitor.

All the elements are in place, show me the bug!

We now have enough context, I hope, to discuss the actual bug. Recall the select statement of interest:

select a, a*(b/100.000000), count(*) from xx group by a, a*(b/100.000000); 
When GroupByNode.init calls addUnAggColumns(), the situation is as follows: We then start into the "for each entry in the GROUP BY list" loop. The first entry is column "a", so we process that column: Now we process the second entry in the GROUP BY list, "a*(b/100.000000)": The damage has now been done. The parent's RCL is left with a messed-up ResultColumn which contains an expression which has a VCN pointing to result set 1 for column "a", and a ColumnReference pointing to result set 2 for column "b". At code generation time, this then generates the malformed projection method "e3" that we saw above.

What is the fix?

The essence of the fix that I propose is to treat expressions as expressions, and individual column references as individual column references, when subtituting VCN values up into the parent's RCL. To do this, I propose to modify the addUnAggColumns() loop so that it uses a two-pass technique, subtituting expressions inline, but deferring substitutions of individual column references until after all expressions have been processed.

Here's the proposed diff:

-                       SubstituteExpressionVisitor se =
-                               new SubstituteExpressionVisitor(
-                                               gbc.getColumnExpression(),
-                                               vc,
+                       //
+                       // Note that we perform the VCN replacements
+                       // for compound GROUP BY expressions inline, while
+                       // we defer the VCN replacements for simple
+                       // GROUP BY column references until after all
+                       // the expressions have been processed. This is
+                       // because a compound expression may contain a
+                       // reference to a simple grouped column, but in
+                       // such a case we want to process the expression
+                       // as an expression, not as individual column
+                       // references. E.g., if the statement was:
+                       //   SELECT ... GROUP BY C1, C1 * (C2 / 100), C3
+                       // then we don't want the replacement of the
+                       // simple column reference C1 to affect the
+                       // compound expression C1 * (C2 / 100). DERBY-3094.
+                       //
+                       ValueNode vn = gbc.getColumnExpression();
+                       SubstituteExpressionVisitor vis =
+                               new SubstituteExpressionVisitor(vn, vc,
                                                AggregateNode.class);
-                       parent.getResultColumns().accept(se);
+                       if (vn instanceof ColumnReference)
+                               referencesToSubstitute.add(vis);
+                       else
+                               parent.getResultColumns().accept(vis);

                        // Since we always need a PR node on top of the GB
                        // node to perform projection we can use it to perform
@@ -414,15 +435,23 @@
                        // GBN (RCL) -> (C1, SUM(C2), <input>, <aggregator>, MAX(C3), <input>, <aggregator>
                        //              |
                        //       FBT (C1, C2)
-                       if (havingClause != null) {
+                       if (havingClause != null)
+                       {
                                SubstituteExpressionVisitor havingSE =
-                                       new SubstituteExpressionVisitor(
-                                                       gbc.getColumnExpression(),
-                                                       vc, null);
-                               havingClause.accept(havingSE);
+                                       new SubstituteExpressionVisitor(vn,vc,null);
+                               if (vn instanceof ColumnReference)
+                                       havingRefsToSubstitute.add(havingSE);
+                               else
+                                       havingClause.accept(havingSE);
                        }
                        gbc.setColumnPosition(bottomRCL.size());
                }
+               for (int r = 0; r < referencesToSubstitute.size(); r++)
+                       parent.getResultColumns().accept(
+                               (SubstituteExpressionVisitor)referencesToSubstitute.get(r));
+               for (int r = 0; r < havingRefsToSubstitute.size(); r++)
+                       havingClause.accept(
+                               (SubstituteExpressionVisitor)havingRefsToSubstitute.get(r));

A side effect of the fix

Note that, before the fix, we were trying to calculate the expression "a*(b/100.000000)" twice, once before the group by, and once after. When the new code runs, it replaces all the elements in the parent's RCL with VCN references, which means that no actual projection method is needed, and the generated code simply emits the GROUP BY results directly, without any further processing (this is hard to see without decompiling the generated code and inspecting it in detail, which I'm skipping for the purposes of this note.

The bug occurs in the HAVING clause too

The following statement also fails:

select a, a*(b/100.000000), count(*) from xx group by a, a*(b/100.000000)
     having a*(b/100.000000) < 1; 

The reason is the same (the substitute expression visitor damages the parent's having clause expression in the first pass, so it isn't properly recognized in the second pass), and the fix is the same.