Description
Goal:
pandas's interpolate supports many methods, linear is applied by default, other methods ( pad ffill backfill bifll ) can also be implemented in pandas API on spark.
The remainder ones ( including quadratic cubic spline ) can not be implemented easily since scipy is used internally and the window frame used is complex.
Since methods ( pad ffill backfill bifll ) were already implemented in pandas API on spark via fillna, so this work currently focus on implementing the missing linear interpolation
Impl:
To implement the linear interpolation, two extra window functions are added, one ( null_index ) is to compute the indices of missing values in each consecutive seq, the other (last_not_null) is to keep the last no-missing value.
index | value | null_index_forward | last_not_null_forward | null_index_backward | last_not_null_backward | filled | filled (limit=1) |
---|---|---|---|---|---|---|---|
1 | nan | 1 | nan | 1 | 1 | - | - |
2 | 1 | 0 | 1 | 0 | 1 | ||
3 | nan | 1 | 1 | 3 | 5 | 2.0 | 2.0 |
4 | nan | 2 | 1 | 2 | 5 | 3.0 | - |
5 | nan | 3 | 1 | 1 | 5 | 4.0 | - |
6 | 5 | 0 | 5 | 0 | 5 | ||
7 | 6 | 0 | 6 | 0 | 6 | ||
8 | nan | 1 | 6 | 2 | nan | 6.0 | 6.0 |
9 | nan | 2 | 6 | 1 | nan | 6.0 | - |
- for the NANs at indices (3,4,5), we always compute the filled value via
(last_not_null_backward - last_not_null_forward) / (null_index_forward + null_index_backward) * null_index_forward + last_not_null_forward
- for the NaN at index(1), skip it due to the default limit_direction = forward
- for the NaN at index(8), fill it like ffill with vlaue last_not_null_forward
- If limit is set, then NaNs with null_index_forward greater than limit will not be interpolated.
Plan
1, impl the basic linear interpolate with param limit
2, add param limit_direction
3, add param limit_area