Your code change to RexProgram doesn't look quite right. If the input has collations (a, b), (x, p), (x, y, z) and we're looking for (x, y) then we shouldn't stop at (x, p), should we?
There was a specific reason that I introduced PRESERVE (ordering by non-projected columns). I couldn't tell from the patch – is that case still handled?
I am working on
CALCITE-526 in branch https://github.com/julianhyde/incubator-calcite/tree/calcite-526. A lot of the work is to do with traits and collation; for instance, I allow a RelNode (but not a RelSubset) to have multiple traits of the same type, if the traitDef supports it. I think we need to solve both issues at the same time.
So I propose that we split the patch in two – the column renames can be checked in first, and I'll fold the collation work into my branch. What do you think?
I'm uncomfortable with the statement "LogicalProject is always created with empty collation". The LogicalXxx nodes are of logical convention but I'm not sure we should disallow them from having other traits. I'm going to put the logic to deduce the collations for core types (project, filter, sort, aggregate, join, union) into a new class RelMdCollation. By default, each RelNode subclass will have the traits you'd expect - e.g. LogicalProject(x, y) on LogicalSort( y ) will indeed be sorted on y - but any code that creates a RelNode subclass can override.
I am well aware that not every implementation of Aggregate(x, y, sum(z)) produces output sorted on (x, y) but to ban collations on logical RelNodes would be going to far the other direction. I don't want to have to wait til we get into the physical domain (e.g. EnumerableXxx) before collation comes into play. That will make it more difficult to share rules.