Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22018

Catalyst Optimizer does not preserve top-level metadata while collapsing projects

    XMLWordPrintableJSON

Details

    Description

      If there are two projects like as follows.

      Project [a_with_metadata#27 AS b#26]
      +- Project [a#0 AS a_with_metadata#27]
         +- LocalRelation <empty>, [a#0, b#1]
      

      Child Project has an output column with a metadata in it, and the parent Project has an alias that implicitly forwards the metadata. So this metadata is visible for higher operators. Upon applying CollapseProject optimizer rule, the metadata is not preserved.

      Project [a#0 AS b#26]
      +- LocalRelation <empty>, [a#0, b#1]
      

      This is incorrect, as downstream operators that expect certain metadata (e.g. watermark in structured streaming) to identify certain fields will fail to do so.

      Attachments

        Activity

          People

            tdas Tathagata Das
            tdas Tathagata Das
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: