[SPARK-22395] Fix the behavior of timestamp values for Pandas to respect session timezone - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.3.0
Component/s: PySpark, SQL
Labels:
- release-notes

Description

When converting Pandas DataFrame/Series from/to Spark DataFrame using toPandas() or pandas udfs, timestamp values behave to respect Python system timezone instead of session timezone.

For example, let's say we use "America/Los_Angeles" as session timezone and have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in Japan so Python timezone would be "Asia/Tokyo".

The timestamp value from current toPandas() will be the following:

>>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) as ts")
>>> df.show()
+-------------------+
|                 ts|
+-------------------+
|1970-01-01 00:00:01|
+-------------------+

>>> df.toPandas()
                   ts
0 1970-01-01 17:00:01

As you can see, the value becomes "1970-01-01 17:00:01" because it respects Python timezone.

As we discussed in https://github.com/apache/spark/pull/18664, we consider this behavior is a bug and the value should be "1970-01-01 00:00:01".

Attachments

Issue Links

is related to

SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to respect session timezone

Resolved

links to

[Github] Pull Request #19607 (ueshin)

[Github] Pull Request #19674 (HyukjinKwon)

Activity

People

Assignee:: Takuya Ueshin

Reporter:: Takuya Ueshin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/Oct/17 04:07

Updated:: 28/Nov/17 13:47

Resolved:: 28/Nov/17 08:46