Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34545

PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together

    XMLWordPrintableJSON

    Details

      Description

      Python UDF returns inconsistent results between evaluating 2 columns together and evaluating one by one.

      The issue occurs after we upgrading to spark3, so seems it doesn't exist in spark2.

      How to reproduce it?

      df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), (3, "3")])], ['c1', 'c2'])
      
      from pyspark.sql.functions import udf
      from pyspark.sql.types import *
      
      def getLastElementWithTimeMaster(data_type):
          def getLastElementWithTime(list_elm):
              # x should be a list of (val, time)
              y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
              return y[-1][0]
          return udf(getLastElementWithTime, data_type)
      
      # Add 2 columns whcih apply Python UDF
      df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
      df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
      
      # Show the results
      df.select("c3").show()
      df.select("c4").show()
      df.select("c3", "c4").show()
      

      Results:

      >>> df.select("c3").show()
      +---+                                                                           
      | c3|
      +---+
      |1.0|
      |2.0|
      |3.1|
      +---+
      >>> df.select("c4").show()
      +---+
      | c4|
      +---+
      |  1|
      |  2|
      |  3|
      +---+
      >>> df.select("c3", "c4").show()
      +---+----+
      | c3|  c4|
      +---+----+
      |1.0|null|
      |2.0|null|
      |3.1|   3|
      +---+----+
      

      The test was done in branch-3.1 local mode.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                petertoth Peter Toth
                Reporter:
                Baohe Zhang Baohe Zhang
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: