Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-36394 Increase pandas API coverage in PySpark
  3. SPARK-38844

impl Series.interpolate and DataFrame.interpolate

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • PySpark
    • None

    Description

      Goal:

      pandas's interpolate supports many methods, linear is applied by default, other methods ( pad ffill backfill bifll ) can also be implemented in pandas API on spark.

      The remainder ones ( including quadratic cubic spline ) can not be implemented easily since scipy is used internally and the window frame used is complex.

      Since methods ( pad ffill backfill bifll ) were already implemented in pandas API on spark via fillna, so this work currently focus on implementing the missing linear interpolation

       

      Impl:

      To implement the linear interpolation, two extra window functions are added, one ( null_index ) is to compute the indices of missing values in each consecutive seq, the other (last_not_null) is to keep the last no-missing value.

      index value null_index_forward last_not_null_forward null_index_backward last_not_null_backward filled filled (limit=1)
      1 nan 1 nan 1 1 - -
      2 1 0 1 0 1    
      3 nan 1 1 3 5 2.0 2.0
      4 nan 2 1 2 5 3.0 -
      5 nan 3 1 1 5 4.0 -
      6 5 0 5 0 5    
      7 6 0 6 0 6    
      8 nan 1 6 2 nan 6.0 6.0
      9 nan 2 6 1 nan 6.0 -
      • for the NANs at indices (3,4,5), we always compute the filled value via

      (last_not_null_backward - last_not_null_forward) / (null_index_forward + null_index_backward) * null_index_forward + last_not_null_forward

      • for the NaN at index(1), skip it due to the default limit_direction = forward
      • for the NaN at index(8), fill it like ffill with vlaue last_not_null_forward
      • If limit is set, then NaNs with null_index_forward greater than limit will not be interpolated.

      Plan

      1, impl the basic linear interpolate with param limit

      2, add param limit_direction

      3, add param limit_area

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            podongfeng Ruifeng Zheng Assign to me
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment