[SPARK-27569] Pandas UDF prefetches Arrow batches in the queue while executing the current batch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Story
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

Current Pandas UDF implementation only fetches the next batch after the execution of the current batch. On the JVM side, writing next batch to the socket is blocked if the Python side doesn't fetch the next batch. We can prefetch the next batch on Python side to enable data pipelining. Theoretically, this can achieve 2x on I/O and compute balanced workload. We saw ~1.5x on real workload.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Xiangrui Meng

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 25/Apr/19 17:44

Updated:: 25/Apr/19 17:44