Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27870

Flush each batch for pandas UDF (for improving pandas UDFs pipeline)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • PySpark, SQL
    • None

    Description

      Flush each batch for pandas UDF.

      This could improve performance when multiple pandas UDF plans are pipelined.

      When batch being flushed in time, downstream pandas UDFs will get pipelined as soon as possible, and pipeline will help hide the donwstream UDFs computation time. For example:

      When the first UDF start computing on batch-3, the second pipelined UDF can start computing on batch-2, and the third pipelined UDF can start computing on batch-1.

      If we do not flush each batch in time, the donwstream UDF's pipeline will lag behind too much, which may increase the total processing time.

       

      Attachments

        Activity

          People

            gurwls223 Hyukjin Kwon
            weichenxu123 Weichen Xu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: