[SPARK-50229] Reduce memory usage on driver for wide schemas by reducing the lifetime of AttributeReference objects created during logical planning - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0
Component/s: Optimizer
Labels:
- pull-request-available

Description

The allAttributes method in QueryPlan (code) unions the output of all of its children. Although this is okay in an optimized plan, in a pre-optimized analyzed plan, these attributes add up multiplicatively with the size of the plan. Because the `output` is usually defined as a `def`, each node’s allAttributes also ends up with a distinct copy of each attribute, potentially causing significant memory pressure on the driver (especially under concurrency and with wide tables).

Here is a simple example with TPC-DS Q42. SQL:

select dt.d_year, item.i_category_id, item.i_category, sum(ss_ext_sales_price)
 from 	date_dim dt, store_sales, item
 where dt.d_date_sk = store_sales.ss_sold_date_sk
 	and store_sales.ss_item_sk = item.i_item_sk
 	and item.i_manager_id = 1
 	and dt.d_moy=11
 	and dt.d_year=2000
 group by 	dt.d_year
 		,item.i_category_id
 		,item.i_category
 order by       sum(ss_ext_sales_price) desc,dt.d_year
 		,item.i_category_id
 		,item.i_category
 limit 100

If we print out the size of each operator’s output and the size of its `allAttributes`:

GlobalLimit: allAttrs: 4, output: 4
LocalLimit: allAttrs: 4, output: 4
Sort: allAttrs: 4, output: 4
Aggregate: allAttrs: 73, output: 4
Filter: allAttrs: 73, output: 73
Join: allAttrs: 73, output: 73
Join: allAttrs: 51, output: 51
SubqueryAlias: allAttrs: 28, output: 28
SubqueryAlias: allAttrs: 28, output: 28
LogicalRelation: allAttrs: 0, output: 28
SubqueryAlias: allAttrs: 23, output: 23
LogicalRelation: allAttrs: 0, output: 23
SubqueryAlias: allAttrs: 22, output: 22
LogicalRelation: allAttrs: 0, output: 22

Note how the joins and aggregate have 73 attributes each, by adding the width of each relation. For queries with wide schemas, this issue is much worse. Optimized plans after column pruning look far better:

Aggregate: allAttrs: 0, output: 1
Project: allAttrs: 2, output: 0
Join: allAttrs: 2, output: 2
Project: allAttrs: 3, output: 1
Join: allAttrs: 3, output: 3
Project: allAttrs: 4, output: 2
Join: allAttrs: 4, output: 4
Project: allAttrs: 5, output: 3
Join: allAttrs: 5, output: 5
Project: allAttrs: 23, output: 4
Filter: allAttrs: 23, output: 23
LogicalRelation: allAttrs: 0, output: 23
Project: allAttrs: 29, output: 1
Filter: allAttrs: 29, output: 29
LogicalRelation: allAttrs: 0, output: 29
Project: allAttrs: 13, output: 1
Filter: allAttrs: 13, output: 13
LogicalRelation: allAttrs: 0, output: 13
Project: allAttrs: 22, output: 1
Filter: allAttrs: 22, output: 22
LogicalRelation: allAttrs: 0, output: 22
Project: allAttrs: 18, output: 1
Filter: allAttrs: 18, output: 18
LogicalRelation: allAttrs: 0, output: 18

Attachments

Issue Links

links to

GitHub Pull Request #48762

Activity

People

Assignee:: Utkarsh Agarwal

Reporter:: Utkarsh Agarwal

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Nov/24 17:11

Updated:: 06/Nov/24 04:01

Resolved:: 06/Nov/24 04:01