Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-50492

Fix java.util.NoSuchElementException when event time column is dropped after dropDuplicatesWithinWatermark

    XMLWordPrintableJSON

Details

    Description

      Consider the following query:

      ```
      val result = inputData.toDF()
      .select("_1", "_2")
      .withColumn("timestamp", to_timestamp($"_2", "yyyy-MM-dd HH:mm:ss"))
      .withWatermark("timestamp", "24 hours")
      .dropDuplicatesWithinWatermark("timestamp")
      .select("_1")[]
      ```
       
      Currently, the ColumnPruning optimization will prune the `timestamp` column since it is not selected in the final Project, leading to a `java.util.NoSuchElementException` when we try to get the event time column in DeduplicateWithinWatermarkExec.
       
      We need to update the references for the DeduplicateWithinWatermark logical plan node so that the event time column is included in the references.

      Attachments

        Activity

          People

            liviazhu-db Livia Zhu
            liviazhu-db Livia Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: