[SPARK-26449] Missing Dataframe.transform API in Python API - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 3.0.0
Component/s: PySpark, SQL
Labels:
None

Flags:

Patch

Description

I would like to chain custom transformations as is suggested in this blog post

This will allow to write something like the following:

 
def with_greeting(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(df, something):
    return df.withColumn("something", lit(something))

data = [("jose", 1), ("li", 2), ("liz", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = (source_df
    .transform(with_greeting)
    .transform(lambda df: with_something(df, "crazy")))
print(actual_df.show())
+----+---+--------+---------+
|name|age|greeting|something|
+----+---+--------+---------+
|jose|  1|      hi|    crazy|
|  li|  2|      hi|    crazy|
| liz|  3|      hi|    crazy|
+----+---+--------+---------+

The only thing needed to accomplish this is the following simple method for DataFrame:

from pyspark.sql.dataframe import DataFrame 
def transform(self, f): 
    return f(self) 
DataFrame.transform = transform

I volunteer to do the pull request if approved (at least the python part)

Attachments

Issue Links

is duplicated by

SPARK-30670 Pipes for PySpark

Resolved

links to

[Github] Pull Request #23414 (chanansh)

GitHub Pull Request #23414

GitHub Pull Request #23877

Activity

People

Assignee:: Erik Christiansen

Reporter:: Hanan Shteingart

Shepherd:: Maciej Szymkiewicz

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 26/Dec/18 20:39

Updated:: 12/Dec/22 18:10

Resolved:: 27/Feb/19 00:24

Time Tracking

Estimated:

24h

Remaining:

24h

Logged:

Not Specified