Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
3.2.0
-
None
-
None
Description
We are in the process of migrating our PySpark applications from Spark version 3.1.2 to Spark version 3.2.0.
This bug is present in version 3.2.0. We do not see this issue in version 3.1.2.
Minimal example to reproduce bug
Below is a minimal example of applying to_utc_timestamp() on String column with timestamp data
from pyspark.sql.types import StringType from pyspark.sql.functions import * # Source data columns = ["id","timestamp_field"] data = [("1", "2022-10-17T00:00:00+0000"), ("2", "2022-10-17T00:00:00+0000")] source_df = spark.createDataFrame(data).toDF(*columns) source_df.createOrReplaceTempView("source") print("Source:") print(source_df.show()) # Execute query query = """ SELECT id, timestamp_field as original, to_utc_timestamp(timestamp_field, 'UTC') AS received_timestamp FROM source """ df = spark.sql(query) print("Transformed:") print(df.show()) print(df.count())
Post Execution
The source data has a column called timestamp_field which is a string type.
Source: +---+--------------------+ | id| timestamp_field| +---+--------------------+ | 1|2022-10-17T00:00:...| | 2|2022-10-17T00:00:...| +---+--------------------+
The query applies to_utc_timestamp() on timestamp_field to create a new column. The new column is null.
Transformed: +---+--------------------+------------------+ | id| original|received_timestamp| +---+--------------------+------------------+ | 1|2022-10-16T00:00:...| null| | 2|2022-10-16T00:00:...| null| +---+--------------------+------------------+
–
Questions
- Did the to_utc_timestamp function get any new changes in spark version 3.2.0? We don't see this issue in spark 3.1.2
- Can we apply any spark settings to resolve this?
- Is there a new preferred function in spark 3.2.0 that replaces to_utc_timestamp?
Attachments
Issue Links
- duplicates
-
SPARK-37067 DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
- Resolved