[SPARK-23360] SparkSession.createDataFrame timestamps can be incorrect with non-Arrow codepath - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.3.0
Component/s: PySpark
Labels:
None

Description

import datetime
import pandas as pd
import os

dt = [datetime.datetime(2015, 10, 31, 22, 30)]
pdf = pd.DataFrame({'time': dt})

os.environ['TZ'] = 'America/New_York'

df1 = spark.createDataFrame(pdf)
df1.show()

+-------------------+
|               time|
+-------------------+
|2015-10-31 21:30:00|
+-------------------+

Seems to related to this line here:

https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776

It appears to be an issue with "tzlocal()"

Wrong:

from_tz = "America/New_York"
to_tz = "tzlocal()"

s.apply(lambda ts:  ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None)
if ts is not pd.NaT else pd.NaT)

0   2015-10-31 21:30:00
Name: time, dtype: datetime64[ns]

Correct:

from_tz = "America/New_York"
to_tz = "America/New_York"

s.apply(
lambda ts: ts.tz_localize(from_tz, ambiguous=False).tz_convert(to_tz).tz_localize(None)
if ts is not pd.NaT else pd.NaT)

0   2015-10-31 22:30:00
Name: time, dtype: datetime64[ns]

Attachments

Issue Links

links to

[Github] Pull Request #20559 (ueshin)

Activity

People

Assignee:: Takuya Ueshin

Reporter:: Li Jin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Feb/18 21:06

Updated:: 12/Dec/22 18:10

Resolved:: 10/Feb/18 16:08