Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.3.0
Description
When converting Pandas DataFrame/Series from/to Spark DataFrame using toPandas() or pandas udfs, timestamp values behave to respect Python system timezone instead of session timezone.
For example, let's say we use "America/Los_Angeles" as session timezone and have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in Japan so Python timezone would be "Asia/Tokyo".
The timestamp value from current toPandas() will be the following:
>>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) as ts") >>> df.show() +-------------------+ | ts| +-------------------+ |1970-01-01 00:00:01| +-------------------+ >>> df.toPandas() ts 0 1970-01-01 17:00:01
As you can see, the value becomes "1970-01-01 17:00:01" because it respects Python timezone.
As we discussed in https://github.com/apache/spark/pull/18664, we consider this behavior is a bug and the value should be "1970-01-01 00:00:01".
Attachments
Issue Links
- is related to
-
SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to respect session timezone
-
- Resolved
-
- links to