Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23360

SparkSession.createDataFrame timestamps can be incorrect with non-Arrow codepath

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      import datetime
      import pandas as pd
      import os
      
      dt = [datetime.datetime(2015, 10, 31, 22, 30)]
      pdf = pd.DataFrame({'time': dt})
      
      os.environ['TZ'] = 'America/New_York'
      
      df1 = spark.createDataFrame(pdf)
      df1.show()
      
      +-------------------+
      |               time|
      +-------------------+
      |2015-10-31 21:30:00|
      +-------------------+
      

      Seems to related to this line here:

      https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776

      It appears to be an issue with "tzlocal()"

      Wrong:

      from_tz = "America/New_York"
      to_tz = "tzlocal()"
      
      s.apply(lambda ts:  ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None)
      if ts is not pd.NaT else pd.NaT)
      
      0   2015-10-31 21:30:00
      Name: time, dtype: datetime64[ns]
      

      Correct:

      from_tz = "America/New_York"
      to_tz = "America/New_York"
      
      s.apply(
      lambda ts: ts.tz_localize(from_tz, ambiguous=False).tz_convert(to_tz).tz_localize(None)
      if ts is not pd.NaT else pd.NaT)
      
      0   2015-10-31 22:30:00
      Name: time, dtype: datetime64[ns]
      

        Attachments

          Activity

            People

            • Assignee:
              ueshin Takuya Ueshin
              Reporter:
              icexelloss Li Jin
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: