Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20685

BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.1.2, 2.2.0
    • PySpark
    • None

    Description

      There's a latent corner-case bug in PYSpark UDF evaluation where executing a stage with a single UDF that takes more than one argument where that argument is repeated will crash at execution with a confusing error.

      Here's a repro:

      from pyspark.sql.types import *
      spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType())
      spark.sql("SELECT add(1, 1)").first()
      

      This fails with

      Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
        File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main
          process()
        File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process
          serializer.dump_stream(func(split_index, iterator), outfile)
        File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 107, in <lambda>
          func = lambda _, it: map(mapper, it)
        File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 93, in <lambda>
          mapper = lambda a: udf(*a)
        File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 71, in <lambda>
          return lambda *a: f(*a)
      TypeError: <lambda>() takes exactly 2 arguments (1 given)
      

      The problem was introduced by SPARK-14267: there code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF, but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs).

      I have a simple fix for this which I'll submit now.

      Attachments

        Issue Links

          Activity

            People

              joshrosen Josh Rosen
              joshrosen Josh Rosen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: