Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7204

Add support for GROUP BY ROLLUP

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Impala 3.0, Impala 2.12.0
    • Fix Version/s: None
    • Component/s: Backend
    • Labels:

      Description

      Now suppose that we'd like to analyze our sales data, to study the amount of sales that is occurring for different products, in different states and regions. Using the ROLLUP feature of SQL 2003, we could issue the query:

      select region, state, product, sum(sales) total_sales
      from sales_history 
      group by rollup (region, state, product)
      

      Semantically, the above query is equivalent to

       

      select region, state, product, sum(sales) total_sales
      from sales_history 
      group by region, state, product
      union
      select region, state, null, sum(sales) total_sales
      from sales_history 
      group by region, state
      union
      select region, null, null, sum(sales) total_sales
      from sales_history 
      group by region
      union
      select null, null, null, sum(sales) total_sales
      from sales_history
       
      

      The query might produce results that looked something like:

      REGION STATE PRODUCT TOTAL_SALES
      ------ ----- ------- -----------
      null null null 6200
      EAST MA BOATS 100
      EAST MA CARS 1500
      EAST MA null 1600
      EAST NY BOATS 150
      EAST NY CARS 1000
      EAST NY null 1150
      EAST null null 2750
      WEST CA BOATS 750
      WEST CA CARS 500
      WEST CA null 1250
      WEST AZ BOATS 2000
      WEST AZ CARS 200
      WEST AZ null 2200
      WEST null null 3450
      
      

      We have a lot of production queries that work around this missing Impala functionality by having three UNION ALLs. Physical execution plan shows Impala actually reads full fact table three times. So it could be a three times improvement (or more, depending on number of columns that are being rolled up).

      I can't find another SQL on Hadoop engine that doesn't support this feature.
      Checked Spark, Hive, PIG, Flink and some other engines - they all do support this basic SQL feature.

      Would be great to have a matching feature in Impala too.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Tagar Ruslan Dautkhanov
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: