Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25801

pandas_udf grouped_map fails with input dataframe with more than 255 columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.4.0
    • PySpark
    • None
    • python 2.7

      pyspark 2.3.0

    Description

      Hi,

      I'm using a pandas_udf to deploy a model to predict all samples in a spark dataframe,

      for this I use a udf as follows:
      @pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def predict_scores(pdf): score_values = model.predict_proba(pdf)[:,1] return pd.DataFrame(

      {'scores': score_values}

      )
      So it takes a dataframe and predicts the probability of being positive according to an sklearn model for each row and returns this as single column. This works great on a random groupBy, e.g.:
      sdf_to_score.groupBy(sf.col('age')).apply(predict_scores)
      as long as the dataframe has <255 columns. When the input dataframe has more than 255 columns (thus features in my model), I get:
      org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main
      func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
      File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 148, in read_udfs
      mapper = eval(mapper_str, udfs)
      File "<string>", line 1
      SyntaxError: more than 255 arguments
      Which seems to be related with Python's general limitation of having not allowing more than 255 arguments for a function?

       

      Is this a bug or is there a straightforward way around this problem?

       

      Regards,

      Frederik

      Attachments

        Activity

          People

            Unassigned Unassigned
            Toekan Frederik
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: