Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-50229

Reduce memory usage on driver for wide schemas by reducing the lifetime of AttributeReference objects created during logical planning

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      The allAttributes method in QueryPlan (code) unions the output of all of its children. Although this is okay in an optimized plan, in a pre-optimized analyzed plan, these attributes add up multiplicatively with the size of the plan. Because the `output` is usually defined as a `def`, each node’s allAttributes also ends up with a distinct copy of each attribute, potentially causing significant memory pressure on the driver (especially under concurrency and with wide tables).

      Here is a simple example with TPC-DS Q42. SQL:

       

      select dt.d_year, item.i_category_id, item.i_category, sum(ss_ext_sales_price)
       from 	date_dim dt, store_sales, item
       where dt.d_date_sk = store_sales.ss_sold_date_sk
       	and store_sales.ss_item_sk = item.i_item_sk
       	and item.i_manager_id = 1
       	and dt.d_moy=11
       	and dt.d_year=2000
       group by 	dt.d_year
       		,item.i_category_id
       		,item.i_category
       order by       sum(ss_ext_sales_price) desc,dt.d_year
       		,item.i_category_id
       		,item.i_category
       limit 100
      

      If we print out the size of each operator’s output and the size of its `allAttributes`:

      GlobalLimit: allAttrs: 4, output: 4
      LocalLimit: allAttrs: 4, output: 4
      Sort: allAttrs: 4, output: 4
      Aggregate: allAttrs: 73, output: 4
      Filter: allAttrs: 73, output: 73
      Join: allAttrs: 73, output: 73
      Join: allAttrs: 51, output: 51
      SubqueryAlias: allAttrs: 28, output: 28
      SubqueryAlias: allAttrs: 28, output: 28
      LogicalRelation: allAttrs: 0, output: 28
      SubqueryAlias: allAttrs: 23, output: 23
      LogicalRelation: allAttrs: 0, output: 23
      SubqueryAlias: allAttrs: 22, output: 22
      LogicalRelation: allAttrs: 0, output: 22

       

      Note how the joins and aggregate have 73 attributes each, by adding the width of each relation. For queries with wide schemas, this issue is much worse. Optimized plans after column pruning look far better:

      Aggregate: allAttrs: 0, output: 1
      Project: allAttrs: 2, output: 0
      Join: allAttrs: 2, output: 2
      Project: allAttrs: 3, output: 1
      Join: allAttrs: 3, output: 3
      Project: allAttrs: 4, output: 2
      Join: allAttrs: 4, output: 4
      Project: allAttrs: 5, output: 3
      Join: allAttrs: 5, output: 5
      Project: allAttrs: 23, output: 4
      Filter: allAttrs: 23, output: 23
      LogicalRelation: allAttrs: 0, output: 23
      Project: allAttrs: 29, output: 1
      Filter: allAttrs: 29, output: 29
      LogicalRelation: allAttrs: 0, output: 29
      Project: allAttrs: 13, output: 1
      Filter: allAttrs: 13, output: 13
      LogicalRelation: allAttrs: 0, output: 13
      Project: allAttrs: 22, output: 1
      Filter: allAttrs: 22, output: 22
      LogicalRelation: allAttrs: 0, output: 22
      Project: allAttrs: 18, output: 1
      Filter: allAttrs: 18, output: 18
      LogicalRelation: allAttrs: 0, output: 18 

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            utkarsh39 Utkarsh Agarwal
            utkarsh39 Utkarsh Agarwal
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment