Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-11976

PyFlink vectorized python udf with Pandas support

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • API / Python
    • None

    Description

      Motivation

      Currently, the PyFlink  allow user to compose Flink data transformation and define UDF in python.  The PyFlink  transform python scripts into operation plans and send it over to Java runtime, having the Java runtime execute the operations accordingly and return the executed result. 

      While encountering Python UDF, the Java runtime create another Python worker, serialized the data and have it send over to python worker. The python worker processed data in row based manner and send it back to Java runtime. 

      How flink python UDF works

      There are several limitation with current python udf:

      • Inefficient data movement between Java and Python (Serialization/Deserialization)
      • Scalar Computation model

       Goals

      • Enable Pandas support in Flink Python UDF.
      • Enable vectorizied Python UDF execution based on Pandas
      • Using Apache Arrow as the serialization format between Java runtime and Python worker

       

      Pandas UDF (vectorized UDF)

      Benefits

      • Provided high performance, easy-to-use data structures and data analysis tools for Python.
      • Pandas already provide interface to directly interact with Apache Arrow
      • Enable vectorized computation to fully taking advantage of the Arrow Memory layout. 

      Attachments

        Activity

          People

            Unassigned Unassigned
            Yurui Zhou Yurui Zhou
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: