Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28978

PySpark: Can't pass more than 256 arguments to a UDF

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.2, 2.4.0, 2.4.4
    • Fix Version/s: 3.0.0
    • Component/s: PySpark
    • Target Version/s:

      Description

      This code:

      https://github.com/apache/spark/blob/712874fa0937f0784f47740b127c3bab20da8569/python/pyspark/worker.py#L367-L379

      Creates Python lambdas that call UDF functions passing arguments singly, rather than using varargs.  For example: `lambda a: f(a[0], a[1], ...)`.

      This fails when there are more than 256 arguments.

      mlflow, when generating model predictions, uses an argument for each feature column.  I have a model with > 500 features.

      I was able to easily hack around this by changing the generated lambdas to use varargs, as in `lambda a: f(*a)`. 

      IDK why these lambdas were created the way they were.  Using varargs is much simpler and works fine in my testing.

       

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bago.amirbekian Bago Amirbekian
                Reporter:
                j1m Jim Fulton
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: