Details
-
Story
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.0.0
-
None
-
None
Description
Current Pandas UDF implementation only fetches the next batch after the execution of the current batch. On the JVM side, writing next batch to the socket is blocked if the Python side doesn't fetch the next batch. We can prefetch the next batch on Python side to enable data pipelining. Theoretically, this can achieve 2x on I/O and compute balanced workload. We saw ~1.5x on real workload.