[SPARK-26410] Support per Pandas UDF configuration - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the "right" batch size usually depends on the task itself. It would be nice if user can configure the batch size when they declare the Pandas UDF.

This is orthogonal to SPARK-23258 (using max buffer size instead of row count).

Besides API, we should also discuss how to merge Pandas UDFs of different configurations. For example,

df.select(predict1(col("features"), predict2(col("features")))

when predict1 requests 100 rows per batch, while predict2 requests 120 rows per batch.

cc: icexelloss bryanc holdenk hyukjin.kwon ueshin smilegator

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Xiangrui Meng

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 19/Dec/18 17:36

Updated:: 17/Mar/20 09:46