I'm using a pandas_udf to deploy a model to predict all samples in a spark dataframe,
for this I use a udf as follows:
@pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def predict_scores(pdf): score_values = model.predict_proba(pdf)[:,1] return pd.DataFrame(
So it takes a dataframe and predicts the probability of being positive according to an sklearn model for each row and returns this as single column. This works great on a random groupBy, e.g.:
as long as the dataframe has <255 columns. When the input dataframe has more than 255 columns (thus features in my model), I get:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 148, in read_udfs
mapper = eval(mapper_str, udfs)
File "<string>", line 1
SyntaxError: more than 255 arguments
Which seems to be related with Python's general limitation of having not allowing more than 255 arguments for a function?
Is this a bug or is there a straightforward way around this problem?