Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40835

to_utc_timestamp creates null column

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 3.2.0
    • None
    • PySpark
    • None

    Description

      We are in the process of migrating our PySpark applications from Spark version 3.1.2 to Spark version 3.2.0. 

      This bug is present in version 3.2.0. We do not see this issue in version 3.1.2.

       

      Minimal example to reproduce bug

      Below is a minimal example of applying to_utc_timestamp() on String column with timestamp data

      from pyspark.sql.types import StringType
      from pyspark.sql.functions import *
      
      # Source data
      columns = ["id","timestamp_field"]
      data = [("1", "2022-10-17T00:00:00+0000"), ("2", "2022-10-17T00:00:00+0000")]
      source_df = spark.createDataFrame(data).toDF(*columns)
      source_df.createOrReplaceTempView("source")
      print("Source:")
      print(source_df.show())
      
      # Execute query
      query = """
      SELECT
          id, 
          timestamp_field as original,  
          to_utc_timestamp(timestamp_field, 'UTC')     AS received_timestamp
      FROM source
      """
      df = spark.sql(query)
      print("Transformed:")
      print(df.show())
      print(df.count()) 

      Post Execution

      The source data has a column called timestamp_field which is a string type.

      Source:
      +---+--------------------+                                                      
      | id|     timestamp_field|
      +---+--------------------+
      |  1|2022-10-17T00:00:...|
      |  2|2022-10-17T00:00:...|
      +---+--------------------+
      

      The query applies to_utc_timestamp() on timestamp_field to create a new column. The new column is null.

      Transformed:
      +---+--------------------+------------------+
      | id|            original|received_timestamp|
      +---+--------------------+------------------+
      |  1|2022-10-16T00:00:...|              null|
      |  2|2022-10-16T00:00:...|              null|
      +---+--------------------+------------------+ 

       

      Questions

      • Did the to_utc_timestamp function get any new changes in spark version 3.2.0? We don't see this issue in spark 3.1.2
      • Can we apply any spark settings to resolve this?
      • Is there a new preferred function in spark 3.2.0 that replaces to_utc_timestamp?

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rbarman Rohan Barman
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: