Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23569

pandas_udf does not work with type-annotated python functions

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.1, 2.4.0
    • Component/s: PySpark
    • Labels:
      None
    • Environment:

      python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision a0d7949896e70f427e7f3942ff340c9484ff0aab

      Description

      When invoked against a type annotated function pandas_udf raises:

      `ValueError: Function has keyword-only parameters or annotations, use getfullargspec() API which can support them`

       

      the deprecated `getargsspec` call occurs in `pyspark/sql/udf.py`

      def _create_udf(f, returnType, evalType):
      
          if evalType in (PythonEvalType.SQL_SCALAR_PANDAS_UDF,
                          PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF):
              import inspect
              from pyspark.sql.utils import require_minimum_pyarrow_version
      
              require_minimum_pyarrow_version()
              argspec = inspect.getargspec(f)
      
              ...

      To reproduce: 

      from pyspark.sql import SparkSession
      
      from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
      
      spark = SparkSession.builder.getOrCreate()
      
      df = spark.range(12).withColumn('b', col('id') * 2)
      
      def ok(a,b): return a*b
      
      df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # no problems
      
      import pandas as pd
      
      def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b
      
      df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
      
       
      
      ---------------------------------------------------------------------------
      ValueError Traceback (most recent call last)
      <ipython-input-17-2e6ae67b15ee> in <module>()
      ----> 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
      
      /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in pandas_udf(f, returnType, functionType)
      2277 return functools.partial(_create_udf, returnType=return_type, evalType=eval_type)
      2278 else:
      -> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
      2280
      2281
      
      /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in _create_udf(f, returnType, evalType)
      44
      45 require_minimum_pyarrow_version()
      ---> 46 argspec = inspect.getargspec(f)
      47
      48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) == 0 and \
      
      /opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
      1043 getfullargspec(func)
      1044 if kwonlyargs or ann:
      -> 1045 raise ValueError("Function has keyword-only parameters or annotations"
      1046 ", use getfullargspec() API which can support them")
      1047 return ArgSpec(args, varargs, varkw, defaults)
      
      ValueError: Function has keyword-only parameters or annotations, use getfullargspec() API which can support them
      

        Attachments

          Activity

            People

            • Assignee:
              mstewart141 Stu (Michael Stewart)
              Reporter:
              mstewart141 Stu (Michael Stewart)
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: