Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39939

shift() func need support periods=0

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.2.2
    • 3.4.0
    • Pandas API on Spark
    • None
    • Pandas: 1.3.X/1.4.X

      PySpark: Master

    Description

      PySpark raises Error when we call shift func with periods=0.

      The behavior of Pandas will return a same copy for the said obj.

       

      PySpark:

      >>> df = ps.DataFrame({'Col1': [10, 20, 15, 30, 45], 'Col2': [13, 23, 18, 33, 48],'Col3': [17, 27, 22, 37, 52]},columns=['Col1', 'Col2', 'Col3'])
      >>> df.Col1.shift(periods=3)
      22/08/02 09:37:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
      22/08/02 09:37:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
      22/08/02 09:37:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
      22/08/02 09:37:52 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
      22/08/02 09:37:52 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
      0     NaN
      1     NaN
      2     NaN
      3    10.0
      4    20.0
      Name: Col1, dtype: float64
      >>> df.Col1.shift(periods=0)
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/home/spark/spark/python/pyspark/pandas/base.py", line 1170, in shift
          return self._shift(periods, fill_value).spark.analyzed
        File "/home/spark/spark/python/pyspark/pandas/spark/accessors.py", line 256, in analyzed
          return first_series(DataFrame(self._data._internal.resolved_copy))
        File "/home/spark/spark/python/pyspark/pandas/utils.py", line 589, in wrapped_lazy_property
          setattr(self, attr_name, fn(self))
        File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1173, in resolved_copy
          sdf = self.spark_frame.select(self.spark_columns + list(HIDDEN_COLUMNS))
        File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2073, in select
          jdf = self._jdf.select(self._jcols(*cols))
        File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
          return_value = get_return_value(
        File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco
          raise converted from None
      pyspark.sql.utils.AnalysisException: Cannot specify window frame for lag function
       

      Pandas:

      >>> pdf = pd.DataFrame({'Col1': [10, 20, 15, 30, 45], 'Col2': [13, 23, 18, 33, 48],'Col3': [17, 27, 22, 37, 52]},columns=['Col1', 'Col2', 'Col3'])
      >>> pdf.Col1.shift(periods=3)
      0     NaN
      1     NaN
      2     NaN
      3    10.0
      4    20.0
      Name: Col1, dtype: float64
      >>> pdf.Col1.shift(periods=0)
      0    10
      1    20
      2    15
      3    30
      4    45
      Name: Col1, dtype: int64
       

      Attachments

        Activity

          People

            bzhaoop bo zhao
            bzhaoop bo zhao
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: