Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42704

SubqueryAlias should propagate metadata columns its child already selects

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.3.2, 3.4.0
    • None
    • Spark Core
    • None

    Description

      The `AddMetadataColumns` analyzer rule intends to make resolve available metadata columns, even if the plan already contains projections that did not explicitly mention the metadata column.

      The `SubqueryAlias` plan node intentionally does not propagate metadata columns automatically from a non-leaf/non-subquery child node, because the following should not work:

       

      spark.read.table("t").select("a", "b").as("s").select("_metadata")

      However, today it is too strict in breaks the metadata chain, in case the child node's output already includes the metadata column:

       

      // expected to work (and does)
      spark.read.table("t")
        .select("a", "b").select("_metadata")
      
      // by extension, should also work (but does not)
      spark.read.table("t").select("a", "b", "_metadata").as("s")
        .select("a", "b").select("_metadata")

      The solution is for `SubqueryAlias` to always propagate metadata columns that are already in the child's output, thus preserving the `metadataOutput` chain for that column.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ryan.johnson@databricks.com Ryan Johnson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: