Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26626

Repeated use of aliases can cause driver OOMs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.0
    • None
    • Optimizer, SQL

    Description

      If a query contains many aliases that are used multiple times, it can cause the driver to OOM, as catalyst will recursively substitute the aliases, making the expression tree size grow exponentially.

      For example:

       

      scala> var df = Seq(1, 2, 3).toDF("a").withColumn("b", lit(10)).cache()
      df: org.apache.spark.sql.DataFrame = [a: int, b: int]
      scala> for( i <- 1 to 5 ) {
       | df = df.select(('a + 'b).as('a), ('a - 'b).as('b))
       | }
      scala> df.queryExecution.analyzed
      res11: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
      Project [(a#526 + b#527) AS a#530, (a#526 - b#527) AS b#531]
      +- Project [(a#522 + b#523) AS a#526, (a#522 - b#523) AS b#527]
       +- Project [(a#518 + b#519) AS a#522, (a#518 - b#519) AS b#523]
       +- Project [(a#514 + b#515) AS a#518, (a#514 - b#515) AS b#519]
       +- Project [(a#509 + b#511) AS a#514, (a#509 - b#511) AS b#515]
       +- Project [a#509, 10 AS b#511]
       +- Project [value#507 AS a#509]
       +- LocalRelation [value#507]
      scala> df.queryExecution.optimizedPlan
      res12: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
      Project [(((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) + (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511)))) + ((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) - (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511))))) AS a#530, (((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) + (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511)))) - ((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) - (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511))))) AS b#531]
      +- InMemoryRelation [a#509, b#511], StorageLevel(disk, memory, deserial...
      

       

       

      In larger real-world instances of this, the expression tree size can explode so large as to OOM the driver.

       

      This is caused by CollapseProject and PhysicalOperation recursively substituting all aliases, without consideration for the effect on the size of the expression tree.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jrickard Jesse Rickard
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: