[SPARK-39939] shift() func need support periods=0 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.2.2
Fix Version/s: 3.4.0
Component/s: Pandas API on Spark
Labels:
None
Environment:

Pandas: 1.3.X/1.4.X

PySpark: Master

Description

PySpark raises Error when we call shift func with periods=0.

The behavior of Pandas will return a same copy for the said obj.

PySpark:

>>> df = ps.DataFrame({'Col1': [10, 20, 15, 30, 45], 'Col2': [13, 23, 18, 33, 48],'Col3': [17, 27, 22, 37, 52]},columns=['Col1', 'Col2', 'Col3'])
>>> df.Col1.shift(periods=3)
22/08/02 09:37:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/08/02 09:37:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/08/02 09:37:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/08/02 09:37:52 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/08/02 09:37:52 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
0     NaN
1     NaN
2     NaN
3    10.0
4    20.0
Name: Col1, dtype: float64
>>> df.Col1.shift(periods=0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/spark/spark/python/pyspark/pandas/base.py", line 1170, in shift
    return self._shift(periods, fill_value).spark.analyzed
  File "/home/spark/spark/python/pyspark/pandas/spark/accessors.py", line 256, in analyzed
    return first_series(DataFrame(self._data._internal.resolved_copy))
  File "/home/spark/spark/python/pyspark/pandas/utils.py", line 589, in wrapped_lazy_property
    setattr(self, attr_name, fn(self))
  File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1173, in resolved_copy
    sdf = self.spark_frame.select(self.spark_columns + list(HIDDEN_COLUMNS))
  File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2073, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: Cannot specify window frame for lag function

Pandas:

>>> pdf = pd.DataFrame({'Col1': [10, 20, 15, 30, 45], 'Col2': [13, 23, 18, 33, 48],'Col3': [17, 27, 22, 37, 52]},columns=['Col1', 'Col2', 'Col3'])
>>> pdf.Col1.shift(periods=3)
0     NaN
1     NaN
2     NaN
3    10.0
4    20.0
Name: Col1, dtype: float64
>>> pdf.Col1.shift(periods=0)
0    10
1    20
2    15
3    30
4    45
Name: Col1, dtype: int64

Attachments

Issue Links

links to

[Github] Pull Request #37366 (bzhaoopenstack)

Activity

People

Assignee:: bo zhao

Reporter:: bo zhao

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Aug/22 01:50

Updated:: 03/Aug/22 10:40

Resolved:: 03/Aug/22 10:40