Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.4.0
-
None
Description
If a query contains many aliases that are used multiple times, it can cause the driver to OOM, as catalyst will recursively substitute the aliases, making the expression tree size grow exponentially.
For example:
scala> var df = Seq(1, 2, 3).toDF("a").withColumn("b", lit(10)).cache() df: org.apache.spark.sql.DataFrame = [a: int, b: int] scala> for( i <- 1 to 5 ) { | df = df.select(('a + 'b).as('a), ('a - 'b).as('b)) | } scala> df.queryExecution.analyzed res11: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [(a#526 + b#527) AS a#530, (a#526 - b#527) AS b#531] +- Project [(a#522 + b#523) AS a#526, (a#522 - b#523) AS b#527] +- Project [(a#518 + b#519) AS a#522, (a#518 - b#519) AS b#523] +- Project [(a#514 + b#515) AS a#518, (a#514 - b#515) AS b#519] +- Project [(a#509 + b#511) AS a#514, (a#509 - b#511) AS b#515] +- Project [a#509, 10 AS b#511] +- Project [value#507 AS a#509] +- LocalRelation [value#507] scala> df.queryExecution.optimizedPlan res12: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [(((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) + (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511)))) + ((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) - (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511))))) AS a#530, (((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) + (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511)))) - ((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) - (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511))))) AS b#531] +- InMemoryRelation [a#509, b#511], StorageLevel(disk, memory, deserial...
In larger real-world instances of this, the expression tree size can explode so large as to OOM the driver.
This is caused by CollapseProject and PhysicalOperation recursively substituting all aliases, without consideration for the effect on the size of the expression tree.
Attachments
Issue Links
- links to