Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12978

Skip unnecessary final group-by when input data already clustered with group-by keys

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • None
    • None
    • SQL
    • None

    Description

      This ticket targets the optimization to skip an unnecessary group-by operation below;

      Without opt.:

      == Physical Plan ==
      TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], output=[col0#159,sum(col1)#177,avg(col2)#178])
      +- TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)], output=[col0#159,sum#200,sum#201,count#202L])
         +- TungstenExchange hashpartitioning(col0#159,200), None
            +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
      

      With opt.:

      == Physical Plan ==
      TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], output=[col0#159,sum(col1)#177,avg(col2)#178])
      +- TungstenExchange hashpartitioning(col0#159,200), None
        +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
      

      Attachments

        Issue Links

          Activity

            People

              maropu Takeshi Yamamuro
              maropu Takeshi Yamamuro
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: