Description
The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count()
while this one does not:
from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count()
and neither this:
from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): df = df[df['first']=="9"] if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count()
See stacktrace here
Using:
spark 2.4.0
Pandas 0.19.2
Pyarrow 0.8.0
Attachments
Issue Links
- duplicates
-
SPARK-25147 GroupedData.apply pandas_udf crashing
-
- Resolved
-