Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-43289

PySpark UDF supports python package dependencies

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.0
    • None
    • Connect, ML, PySpark
    • None

    Description

      Requirements

       

      Make the pyspark UDF support annotating python dependencies and when executing UDF, the UDF worker creates a new python environment with provided python dependencies.

      Motivation

       

      We have two major cases:

       

      • For spark connect case, the client python environment is very likely to be different with pyspark server side python environment, this causes user's UDF function execution failure in pyspark server side.
      • Some machine learning third-party library (e.g. MLflow) requires pyspark UDF supporting  dependencies, because in ML cases, we need to run model inference by pyspark UDF in the exactly the same python environment that trains the model. Currently MLflow supports it by creating a child python process in pyspark UDF worker, and redirecting all UDF input data to the child python process to run model inference, this way it causes significant overhead, if pyspark UDF support builtin python dependency management then we don't need such poorly performing approach.

       

      Proposed API

      ```

      @pandas_udf("string", pip_requirements=...)

      ```

      `pip_requirements` argument means either an iterable of pip requirement strings (e.g. ``["scikit-learn", "-r /path/to/req2.txt", "-c /path/to/constraints.txt"]``) or the string path to a pip requirements file path on the local filesystem (e.g. ``"/path/to/requirements.txt"``) represents the pip requirements for the python UDF.

      Attachments

        Activity

          People

            weichenxu123 Weichen Xu
            weichenxu123 Weichen Xu
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: