Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-50229

Reduce memory usage on driver for wide schemas by reducing the lifetime of AttributeReference objects created during logical planning

    XMLWordPrintableJSON

Details

    Description

      The allAttributes method in QueryPlan (code) unions the output of all of its children. Although this is okay in an optimized plan, in a pre-optimized analyzed plan, these attributes add up multiplicatively with the size of the plan. Because the `output` is usually defined as a `def`, each node’s allAttributes also ends up with a distinct copy of each attribute, potentially causing significant memory pressure on the driver (especially under concurrency and with wide tables).

      Here is a simple example with TPC-DS Q42. SQL:

       

      select dt.d_year, item.i_category_id, item.i_category, sum(ss_ext_sales_price)
       from 	date_dim dt, store_sales, item
       where dt.d_date_sk = store_sales.ss_sold_date_sk
       	and store_sales.ss_item_sk = item.i_item_sk
       	and item.i_manager_id = 1
       	and dt.d_moy=11
       	and dt.d_year=2000
       group by 	dt.d_year
       		,item.i_category_id
       		,item.i_category
       order by       sum(ss_ext_sales_price) desc,dt.d_year
       		,item.i_category_id
       		,item.i_category
       limit 100
      

      If we print out the size of each operator’s output and the size of its `allAttributes`:

      GlobalLimit: allAttrs: 4, output: 4
      LocalLimit: allAttrs: 4, output: 4
      Sort: allAttrs: 4, output: 4
      Aggregate: allAttrs: 73, output: 4
      Filter: allAttrs: 73, output: 73
      Join: allAttrs: 73, output: 73
      Join: allAttrs: 51, output: 51
      SubqueryAlias: allAttrs: 28, output: 28
      SubqueryAlias: allAttrs: 28, output: 28
      LogicalRelation: allAttrs: 0, output: 28
      SubqueryAlias: allAttrs: 23, output: 23
      LogicalRelation: allAttrs: 0, output: 23
      SubqueryAlias: allAttrs: 22, output: 22
      LogicalRelation: allAttrs: 0, output: 22

       

      Note how the joins and aggregate have 73 attributes each, by adding the width of each relation. For queries with wide schemas, this issue is much worse. Optimized plans after column pruning look far better:

      Aggregate: allAttrs: 0, output: 1
      Project: allAttrs: 2, output: 0
      Join: allAttrs: 2, output: 2
      Project: allAttrs: 3, output: 1
      Join: allAttrs: 3, output: 3
      Project: allAttrs: 4, output: 2
      Join: allAttrs: 4, output: 4
      Project: allAttrs: 5, output: 3
      Join: allAttrs: 5, output: 5
      Project: allAttrs: 23, output: 4
      Filter: allAttrs: 23, output: 23
      LogicalRelation: allAttrs: 0, output: 23
      Project: allAttrs: 29, output: 1
      Filter: allAttrs: 29, output: 29
      LogicalRelation: allAttrs: 0, output: 29
      Project: allAttrs: 13, output: 1
      Filter: allAttrs: 13, output: 13
      LogicalRelation: allAttrs: 0, output: 13
      Project: allAttrs: 22, output: 1
      Filter: allAttrs: 22, output: 22
      LogicalRelation: allAttrs: 0, output: 22
      Project: allAttrs: 18, output: 1
      Filter: allAttrs: 18, output: 18
      LogicalRelation: allAttrs: 0, output: 18 

       

      Attachments

        Issue Links

          Activity

            People

              utkarsh39 Utkarsh Agarwal
              utkarsh39 Utkarsh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: