Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.0, 3.4.1, 4.0.0
-
None
Description
Use one of the existing test:
- "11H" case of test_dataframe_resample (pyspark.pandas.tests.test_resample.ResampleTests)
- "1001H" case of test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests)
After setting the TZ for example to New York (like by using the following python code in a "setUpClass":
os.environ["TZ"] = 'America/New_York'
)
You will get the error for the latter mentioned test:
====================================================================== FAIL [4.219s]: test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 276, in test_series_resample self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", "sum") File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 259, in _test_resample self.assert_eq( File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in assert_eq _assert_pandas_almost_equal(lobj, robj) File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in _assert_pandas_almost_equal raise PySparkAssertionError( pyspark.errors.exceptions.base.PySparkAssertionError: [DIFFERENT_PANDAS_SERIES] Series are not almost equal: Left: Freq: 1001H float64 Right: float64
The problem is the in the pyspark resample there will be more resampled rows in the result. The DST change will cause those extra lines as the computed _tmp_resample_bin_col_ be something like:
| __index_level_0__ | __tmp_resample_bin_col__ | A ..... |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919 | |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046 | |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365 | |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 | |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799 | |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284 | |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166 | |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583 | |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088 | |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812| |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563 |
You can see the extra lines around when the DST kicked in on 2011-03-13 in New York.
Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not help.
You can see my tests here:
https://github.com/attilapiros/spark/pull/5
Pandas timestamps are TZ less:
`
import pandas as pd a = pd.Timestamp(year=2011, month=3, day=13, hour=1) b = pd.Timedelta(hours=1) >> a Timestamp('2011-03-13 01:00:00') >>> a+b Timestamp('2011-03-13 02:00:00') >>> a+b+b Timestamp('2011-03-13 03:00:00')
But pyspark TimestampType uses TZ and DST:
>>> sql("select TIMESTAMP '2011-03-13 01:00:00'").show() +-------------------------------+ |TIMESTAMP '2011-03-13 01:00:00'| +-------------------------------+ | 2011-03-13 01:00:00| +-------------------------------+ >>> sql("select TIMESTAMP '2011-03-13 01:00:00' + make_interval(0,0,0,0,1,0,0)").show() +--------------------------------------------------------------------+ |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)| +--------------------------------------------------------------------+ | 2011-03-13 03:00:00| +--------------------------------------------------------------------+
The current resample code uses the above interval based calculation.