Details
-
Bug
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.3.2, 3.4.0
-
None
-
None
Description
The `AddMetadataColumns` analyzer rule intends to make resolve available metadata columns, even if the plan already contains projections that did not explicitly mention the metadata column.
The `SubqueryAlias` plan node intentionally does not propagate metadata columns automatically from a non-leaf/non-subquery child node, because the following should not work:
spark.read.table("t").select("a", "b").as("s").select("_metadata")
However, today it is too strict in breaks the metadata chain, in case the child node's output already includes the metadata column:
// expected to work (and does) spark.read.table("t") .select("a", "b").select("_metadata") // by extension, should also work (but does not) spark.read.table("t").select("a", "b", "_metadata").as("s") .select("a", "b").select("_metadata")
The solution is for `SubqueryAlias` to always propagate metadata columns that are already in the child's output, thus preserving the `metadataOutput` chain for that column.