Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
3.2.2
-
None
-
Pandas: 1.3.X/1.4.X
PySpark: Master
Description
PySpark raises Error when we call shift func with periods=0.
The behavior of Pandas will return a same copy for the said obj.
PySpark:
>>> df = ps.DataFrame({'Col1': [10, 20, 15, 30, 45], 'Col2': [13, 23, 18, 33, 48],'Col3': [17, 27, 22, 37, 52]},columns=['Col1', 'Col2', 'Col3']) >>> df.Col1.shift(periods=3) 22/08/02 09:37:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 22/08/02 09:37:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 22/08/02 09:37:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 22/08/02 09:37:52 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 22/08/02 09:37:52 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 0 NaN 1 NaN 2 NaN 3 10.0 4 20.0 Name: Col1, dtype: float64 >>> df.Col1.shift(periods=0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/spark/spark/python/pyspark/pandas/base.py", line 1170, in shift return self._shift(periods, fill_value).spark.analyzed File "/home/spark/spark/python/pyspark/pandas/spark/accessors.py", line 256, in analyzed return first_series(DataFrame(self._data._internal.resolved_copy)) File "/home/spark/spark/python/pyspark/pandas/utils.py", line 589, in wrapped_lazy_property setattr(self, attr_name, fn(self)) File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1173, in resolved_copy sdf = self.spark_frame.select(self.spark_columns + list(HIDDEN_COLUMNS)) File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2073, in select jdf = self._jdf.select(self._jcols(*cols)) File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__ return_value = get_return_value( File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco raise converted from None pyspark.sql.utils.AnalysisException: Cannot specify window frame for lag function
Pandas:
>>> pdf = pd.DataFrame({'Col1': [10, 20, 15, 30, 45], 'Col2': [13, 23, 18, 33, 48],'Col3': [17, 27, 22, 37, 52]},columns=['Col1', 'Col2', 'Col3']) >>> pdf.Col1.shift(periods=3) 0 NaN 1 NaN 2 NaN 3 10.0 4 20.0 Name: Col1, dtype: float64 >>> pdf.Col1.shift(periods=0) 0 10 1 20 2 15 3 30 4 45 Name: Col1, dtype: int64