Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-36394 Increase pandas API coverage in PySpark
  3. SPARK-38844

impl Series.interpolate and DataFrame.interpolate

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • PySpark
    • None

    Description

      Goal:

      pandas's interpolate supports many methods, linear is applied by default, other methods ( pad ffill backfill bifll ) can also be implemented in pandas API on spark.

      The remainder ones ( including quadratic cubic spline ) can not be implemented easily since scipy is used internally and the window frame used is complex.

      Since methods ( pad ffill backfill bifll ) were already implemented in pandas API on spark via fillna, so this work currently focus on implementing the missing linear interpolation

       

      Impl:

      To implement the linear interpolation, two extra window functions are added, one ( null_index ) is to compute the indices of missing values in each consecutive seq, the other (last_not_null) is to keep the last no-missing value.

      index value null_index_forward last_not_null_forward null_index_backward last_not_null_backward filled filled (limit=1)
      1 nan 1 nan 1 1 - -
      2 1 0 1 0 1    
      3 nan 1 1 3 5 2.0 2.0
      4 nan 2 1 2 5 3.0 -
      5 nan 3 1 1 5 4.0 -
      6 5 0 5 0 5    
      7 6 0 6 0 6    
      8 nan 1 6 2 nan 6.0 6.0
      9 nan 2 6 1 nan 6.0 -
      • for the NANs at indices (3,4,5), we always compute the filled value via

      (last_not_null_backward - last_not_null_forward) / (null_index_forward + null_index_backward) * null_index_forward + last_not_null_forward

      • for the NaN at index(1), skip it due to the default limit_direction = forward
      • for the NaN at index(8), fill it like ffill with vlaue last_not_null_forward
      • If limit is set, then NaNs with null_index_forward greater than limit will not be interpolated.

      Plan

      1, impl the basic linear interpolate with param limit

      2, add param limit_direction

      3, add param limit_area

      Attachments

        Activity

          People

            podongfeng Ruifeng Zheng
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: