[SPARK-25244] [Python] Setting `spark.sql.session.timeZone` only partially respected - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.1
Fix Version/s: None
Component/s: PySpark
Labels:
- bulk-closed

Description

The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described here. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used.

This can be checked by the following code snippet

import pyspark.sql

spark = (pyspark
         .sql
         .SparkSession
         .builder
         .master('local[1]')
         .config("spark.sql.session.timeZone", "UTC")
         .getOrCreate()
        )

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])

Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin)

2018-06-01 01:00:00
2018-06-01 03:00:00

Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would try to come up with a patch.

Attachments

Issue Links

is cloned by

SPARK-32123 [Python] Setting `spark.sql.session.timeZone` only partially respected

In Progress

Activity

People

Assignee:: Unassigned

Reporter:: Anton Daitche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Aug/18 17:43

Updated:: 12/Dec/22 18:10

Resolved:: 08/Oct/19 05:44