Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.3.1
-
None
Description
The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described here. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used.
This can be checked by the following code snippet
import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0])
Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin)
2018-06-01 01:00:00 2018-06-01 03:00:00
Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone.
The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone.
If the maintainers agree that this should be fixed, I would try to come up with a patch.
Attachments
Issue Links
- is cloned by
-
SPARK-32123 [Python] Setting `spark.sql.session.timeZone` only partially respected
- In Progress