Details
Description
In pyspark, when filtering on a udf derived column after some join types,
the optimized logical plan results is a java.lang.UnsupportedOperationException.
I could not replicate this in scala code from the shell, just python. It is a pyspark regression from spark 1.6.2.
This can be replicated with: bin/spark-submit bug.py
import pyspark.sql.functions as F from pyspark.sql import Row, SparkSession if __name__ == '__main__': spark = SparkSession.builder.appName("test").getOrCreate() left = spark.createDataFrame([Row(a=1)]) right = spark.createDataFrame([Row(a=1)]) df = left.join(right, on='a', how='left_outer') df = df.withColumn('b', F.udf(lambda x: 'x')(df.a)) df = df.filter('b = "x"') df.explain(extended=True)
The output is:
== Parsed Logical Plan == 'Filter ('b = x) +- Project [a#0L, <lambda>(a#0L) AS b#8] +- Project [a#0L] +- Join LeftOuter, (a#0L = a#3L) :- LogicalRDD [a#0L] +- LogicalRDD [a#3L] == Analyzed Logical Plan == a: bigint, b: string Filter (b#8 = x) +- Project [a#0L, <lambda>(a#0L) AS b#8] +- Project [a#0L] +- Join LeftOuter, (a#0L = a#3L) :- LogicalRDD [a#0L] +- LogicalRDD [a#3L] == Optimized Logical Plan == java.lang.UnsupportedOperationException: Cannot evaluate expression: <lambda>(input[0, bigint, true]) == Physical Plan == java.lang.UnsupportedOperationException: Cannot evaluate expression: <lambda>(input[0, bigint, true])
It fails when the join is:
- how='outer', on=column expression
- how='left_outer', on=string or column expression
- how='right_outer', on=string or column expression
It passes when the join is:
- how='inner', on=string or column expression
- how='outer', on=string
I made some tests to demonstrate each of these.
Run with bin/spark-submit test_bug.py
Attachments
Attachments
Issue Links
- is related to
-
SPARK-18589 persist() resolves "java.lang.RuntimeException: Invalid PythonUDF <lambda>(...), requires attributes from more than one child"
- Resolved
- links to