[SPARK-26412] Allow Pandas UDF to take an iterator of pd.DataFrames - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.0.0
Component/s: PySpark
Labels:
None

Target Version/s:

3.0.0

Description

Pandas UDF is the ideal connection between PySpark and DL model inference workload. However, user needs to load the model file first to make predictions. It is common to see models of size ~100MB or bigger. If the Pandas UDF execution is limited to each batch, user needs to repeatedly load the same model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code handle it:

@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
    yield model.predict(batch)

The type of each batch is:

a pd.Series if UDF is called with a single non-struct-type column
a tuple of pd.Series if UDF is called with more than one Spark DF columns
a pd.DataFrame if UDF is called with a single StructType column

Examples:

@pandas_udf(...)
def evaluate(batch_iter):
  model = ... # load model
  for features, label in batch_iter:
    pred = model.predict(features)
    yield (pred - label).abs()

df.select(evaluate(col("features"), col("label")).alias("err"))

@pandas_udf(...)
def evaluate(pdf_iter):
  model = ... # load model
  for pdf in pdf_iter:
    pred = model.predict(pdf['x'])
    yield (pred - pdf['y']).abs()

df.select(evaluate(struct(col("features"), col("label"))).alias("err"))

If the UDF doesn't return the same number of records for the entire partition, user should see an error. We don't restrict that every yield should match the input batch size.

Another benefit is with iterator interface and asyncio from Python, it is flexible for users to implement data pipelining.

cc: icexelloss bryanc holdenk hyukjin.kwon ueshin smilegator

Attachments

Issue Links

blocks

SPARK-28056 Document SCALAR_ITER Pandas UDF

Resolved

is related to

SPARK-24579 SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

Open

relates to

SPARK-26413 SPIP: RDD Arrow Support in Spark Core and PySpark

Open

links to

GitHub Pull Request #24643

GitHub Pull Request #28135

Activity

People

Assignee:: Weichen Xu

Reporter:: Xiangrui Meng

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 19/Dec/18 18:11

Updated:: 12/Dec/22 18:11

Resolved:: 15/Jun/19 15:29