[SPARK-37392] Catalyst optimizer very time-consuming and memory-intensive with some "explode(array)" - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.2, 3.2.0
Fix Version/s: 3.1.3, 3.2.1, 3.3.0
Component/s: Optimizer
Labels:
None

Description

The problem occurs with the simple code below:

import session.implicits._

Seq(
  (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x")
).toDF()
  .checkpoint() // or save and reload to truncate lineage
  .createOrReplaceTempView("sub")

session.sql("""
  SELECT
    *
  FROM
  (
    SELECT
      EXPLODE( ARRAY( * ) ) result
    FROM
    (
      SELECT
        _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
      FROM
        sub
    )
  )
  WHERE
    result != ''
  """).show()

It takes several minutes and a very high Java heap usage, when it should be immediate.

It does not occur when replacing the unique integer value (1) with a string value ("x").

All the time is spent in the PruneFilters optimization rule.

Not reproduced in Spark 2.4.1.

Attachments

Issue Links

relates to

SPARK-40262 Expensive UDF evaluation pushed down past a join leads to performance issues

Open

links to

[Github] Pull Request #34823 (cloud-fan)

Activity

People

Assignee:: Wenchen Fan

Reporter:: Francois MARTIN

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 19/Nov/21 19:58

Updated:: 31/Aug/22 22:52

Resolved:: 08/Dec/21 05:07