Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27569

Pandas UDF prefetches Arrow batches in the queue while executing the current batch

    XMLWordPrintableJSON

Details

    • Story
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • PySpark
    • None

    Description

      Current Pandas UDF implementation only fetches the next batch after the execution of the current batch. On the JVM side, writing next batch to the socket is blocked if the Python side doesn't fetch the next batch. We can prefetch the next batch on Python side to enable data pipelining. Theoretically, this can achieve 2x on I/O and compute balanced workload. We saw ~1.5x on real workload.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mengxr Xiangrui Meng
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: