[SPARK-27691] Issue when running queries using filter predicates on pandas GROUPED_AGG udafs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.4.2
Fix Version/s: None
Component/s: Input/Output
Labels:
- bulk-closed

Description

Am currently running pyspark 2.4.2 and I am currently unable to run the following code.

from pyspark.sql import functions, types
import pandas as pd
import random

# initialize test data
test_data = [[False, int(random.random() * 2)] for i in range(10000)]
test_data = pd.DataFrame(test_data, columns=['bool_value', 'int_value'])

# pandas udaf
pandas_any_udf = functions.pandas_udf(lambda x: x.any(), types.BooleanType(), functions.PandasUDFType.GROUPED_AGG)

# create spark DataFrame and build the query
test_df = spark.createDataFrame(test_data)
test_df = test_df.groupby('int_value').agg(pandas_any_udf('bool_value').alias('bool_any_result'))
test_df = test_df.filter(functions.col('bool_any_result') == True)

# write to output
test_df.write.parquet('/tmp/mtong/write_test', mode='overwrite')

Below is a truncated error message.

Py4JJavaError: An error occurred while calling o1125.parquet. : org.apache.spark.SparkException: Job aborted.

...

Exchange hashpartitioning(int_value#123L, 2000)
+- *(1) Filter (<lambda>(bool_value#122) = true)
   +- Scan ExistingRDD arrow[bool_value#122,int_value#123L]

...

Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: <lambda>(input[0, boolean, true])

What appears to be happening is that the query optimizer incorrectly pushes up the filter predicate on bool_any_result before the group by operation. This causes the query to error out before spark attempts to execute the query. I have also tried running a variant of this query with functions.count() as the aggregation function and the query ran fine, so I believe that this is an error that only affects pandas udafs.

Variant of query with standard aggregation function

test_df = spark.createDataFrame(test_data)
test_df = test_df.groupby('int_value').agg(functions.count('bool_value').alias('bool_counts'))
test_df = test_df.filter(functions.col('bool_counts') > 0)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Michael Tong

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 13/May/19 17:32

Updated:: 25/May/21 01:52

Resolved:: 25/May/21 01:38