Details
Description
The problem occurs with the simple code below:
import session.implicits._ Seq( (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x") ).toDF() .checkpoint() // or save and reload to truncate lineage .createOrReplaceTempView("sub") session.sql(""" SELECT * FROM ( SELECT EXPLODE( ARRAY( * ) ) result FROM ( SELECT _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u FROM sub ) ) WHERE result != '' """).show()
It takes several minutes and a very high Java heap usage, when it should be immediate.
It does not occur when replacing the unique integer value (1) with a string value ("x").
All the time is spent in the PruneFilters optimization rule.
Not reproduced in Spark 2.4.1.
Attachments
Issue Links
- relates to
-
SPARK-40262 Expensive UDF evaluation pushed down past a join leads to performance issues
- Open
- links to