Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23360

SparkSession.createDataFrame timestamps can be incorrect with non-Arrow codepath

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.3.0
    • PySpark
    • None

    Description

      import datetime
      import pandas as pd
      import os
      
      dt = [datetime.datetime(2015, 10, 31, 22, 30)]
      pdf = pd.DataFrame({'time': dt})
      
      os.environ['TZ'] = 'America/New_York'
      
      df1 = spark.createDataFrame(pdf)
      df1.show()
      
      +-------------------+
      |               time|
      +-------------------+
      |2015-10-31 21:30:00|
      +-------------------+
      

      Seems to related to this line here:

      https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776

      It appears to be an issue with "tzlocal()"

      Wrong:

      from_tz = "America/New_York"
      to_tz = "tzlocal()"
      
      s.apply(lambda ts:  ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None)
      if ts is not pd.NaT else pd.NaT)
      
      0   2015-10-31 21:30:00
      Name: time, dtype: datetime64[ns]
      

      Correct:

      from_tz = "America/New_York"
      to_tz = "America/New_York"
      
      s.apply(
      lambda ts: ts.tz_localize(from_tz, ambiguous=False).tz_convert(to_tz).tz_localize(None)
      if ts is not pd.NaT else pd.NaT)
      
      0   2015-10-31 22:30:00
      Name: time, dtype: datetime64[ns]
      

      Attachments

        Activity

          People

            ueshin Takuya Ueshin
            icexelloss Li Jin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: