This is a correctness bug when reusing a set of project expressions in the DataFrame API.
Use case: a user was migrating a table to a new version with an additional column ("data" in the repro case). To migrate the user unions the old table ("t2") with the new table ("t1"), and applies a common set of projections to ensure the union doesn't hit an issue with ordering (
SPARK-22335). In some cases, this produces an incorrect query plan:
The problem happens because "outputCols" has an alias. The ID for that alias is created when the projection Seq is created, so it is reused in both sides of the union.
When FoldablePropagation runs, it identifies that "data" in the t2 side of the union is a foldable expression and replaces all references to it, including the references in the t1 side of the union.
The join to a dimension table is necessary to reproduce the problem because it requires a Projection on top of the join that uses an AttributeReference for data#237. Otherwise, the projections are collapsed and the projection includes an Alias that does not get rewritten by FoldablePropagation.