Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44717

"pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0, 3.4.1, 4.0.0
    • 3.5.0, 4.0.0
    • Pandas API on Spark
    • None

    Description

      Use one of the existing test:

      • "11H" case of test_dataframe_resample (pyspark.pandas.tests.test_resample.ResampleTests)
      • "1001H" case of test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests)

      After setting the TZ for example to New York (like by using the following python code in a "setUpClass":

      os.environ["TZ"] = 'America/New_York'
      

      )

      You will get the error for the latter mentioned test:

      ======================================================================
      FAIL [4.219s]: test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests)
      ----------------------------------------------------------------------
      Traceback (most recent call last):
        File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 276, in test_series_resample
          self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", "sum")
        File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 259, in _test_resample
          self.assert_eq(
        File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in assert_eq
          _assert_pandas_almost_equal(lobj, robj)
        File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in _assert_pandas_almost_equal
          raise PySparkAssertionError(
      pyspark.errors.exceptions.base.PySparkAssertionError: [DIFFERENT_PANDAS_SERIES] Series are not almost equal:
      Left:
      Freq: 1001H
      float64
      Right:
      float64
      

      The problem is the in the pyspark resample there will be more resampled rows in the result. The DST change will cause those extra lines as the computed _tmp_resample_bin_col_ be something like:

      | __index_level_0__  | __tmp_resample_bin_col__ | A
      .....
      |2011-03-08 00:00:00|2011-03-26 11:00:00     |0.3980551570183919  |
      |2011-03-09 00:00:00|2011-03-26 11:00:00     |0.6511376673995046  |
      |2011-03-10 00:00:00|2011-03-26 11:00:00     |0.6141085426890365  |
      |2011-03-11 00:00:00|2011-03-26 11:00:00     |0.11557638066163867 |
      |2011-03-12 00:00:00|2011-03-26 11:00:00     |0.4517788243490799  |
      |2011-03-13 00:00:00|2011-03-26 11:00:00     |0.8637060550157284  |
      |2011-03-14 00:00:00|2011-03-26 10:00:00     |0.8169499149450166  |
      |2011-03-15 00:00:00|2011-03-26 10:00:00     |0.4585916249356583  |
      |2011-03-16 00:00:00|2011-03-26 10:00:00     |0.8362472880832088  |
      |2011-03-17 00:00:00|2011-03-26 10:00:00     |0.026716901748386812|
      |2011-03-18 00:00:00|2011-03-26 10:00:00     |0.9086816462089563  |
      

      You can see the extra lines around when the DST kicked in on 2011-03-13 in New York.

      Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not help.

      You can see my tests here:
      https://github.com/attilapiros/spark/pull/5

      Pandas timestamps are TZ less:
      `

      import pandas as pd
      a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
      b = pd.Timedelta(hours=1)
      
      >> a 
      Timestamp('2011-03-13 01:00:00')
      >>> a+b
      Timestamp('2011-03-13 02:00:00')
      >>> a+b+b
      Timestamp('2011-03-13 03:00:00')
      

      But pyspark TimestampType uses TZ and DST:

      >>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
      +-------------------------------+
      |TIMESTAMP '2011-03-13 01:00:00'|
      +-------------------------------+
      |            2011-03-13 01:00:00|
      +-------------------------------+
      
      >>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + make_interval(0,0,0,0,1,0,0)").show()
      +--------------------------------------------------------------------+
      |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
      +--------------------------------------------------------------------+
      |                                                 2011-03-13 03:00:00|
      +--------------------------------------------------------------------+
      

      The current resample code uses the above interval based calculation.

      Attachments

        Activity

          People

            gurwls223 Hyukjin Kwon
            attilapiros Attila Zsolt Piros
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: