Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37392

Catalyst optimizer very time-consuming and memory-intensive with some "explode(array)"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.2, 3.2.0
    • 3.1.3, 3.2.1, 3.3.0
    • Optimizer
    • None

    Description

      The problem occurs with the simple code below:

      import session.implicits._
      
      Seq(
        (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x")
      ).toDF()
        .checkpoint() // or save and reload to truncate lineage
        .createOrReplaceTempView("sub")
      
      session.sql("""
        SELECT
          *
        FROM
        (
          SELECT
            EXPLODE( ARRAY( * ) ) result
          FROM
          (
            SELECT
              _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
            FROM
              sub
          )
        )
        WHERE
          result != ''
        """).show() 

      It takes several minutes and a very high Java heap usage, when it should be immediate.

      It does not occur when replacing the unique integer value (1) with a string value ("x").

      All the time is spent in the PruneFilters optimization rule.

      Not reproduced in Spark 2.4.1.

      Attachments

        Issue Links

          Activity

            People

              cloud_fan Wenchen Fan
              martinf Francois MARTIN
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: