Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22018

Catalyst Optimizer does not preserve top-level metadata while collapsing projects

    XMLWordPrintableJSON

    Details

      Description

      If there are two projects like as follows.

      Project [a_with_metadata#27 AS b#26]
      +- Project [a#0 AS a_with_metadata#27]
         +- LocalRelation <empty>, [a#0, b#1]
      

      Child Project has an output column with a metadata in it, and the parent Project has an alias that implicitly forwards the metadata. So this metadata is visible for higher operators. Upon applying CollapseProject optimizer rule, the metadata is not preserved.

      Project [a#0 AS b#26]
      +- LocalRelation <empty>, [a#0, b#1]
      

      This is incorrect, as downstream operators that expect certain metadata (e.g. watermark in structured streaming) to identify certain fields will fail to do so.

        Attachments

          Activity

            People

            • Assignee:
              tdas Tathagata Das
              Reporter:
              tdas Tathagata Das
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: