Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2791

Avoid unnecessary two-phased aggregation.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • Impala 2.2, Impala 2.3.0
    • None
    • Frontend

    Description

      We perform a two-phased aggregation for evaluating distinct aggregate expressions like count(distinct). However, if the distinct aggregate expression is not referenced in enclosing query blocks, then the two-phased aggregation is unnecessary and should be skipped.

      Example:

       explain select x from
        (select count(int_col) x, count(distinct bigint_col) y from functional.alltypes) v;
      +-----------------------------------------------------------+
      | Explain String                                            |
      +-----------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=170.00MB VCores=2 |
      |                                                           |
      | 06:AGGREGATE [FINALIZE]                                   |
      | |  output: count:merge(bigint_col), count:merge(int_col)  |
      | |                                                         |
      | 05:EXCHANGE [UNPARTITIONED]                               |
      | |                                                         |
      | 02:AGGREGATE                                              |
      | |  output: count(bigint_col), count:merge(int_col)        |
      | |                                                         |
      | 04:AGGREGATE                                              |
      | |  output: count:merge(int_col)                           |
      | |  group by: bigint_col                                   |
      | |                                                         |
      | 03:EXCHANGE [HASH(bigint_col)]                            |
      | |                                                         |
      | 01:AGGREGATE                                              |
      | |  output: count(int_col)                                 |
      | |  group by: bigint_col                                   |
      | |                                                         |
      | 00:SCAN HDFS [functional.alltypes]                        |
      |    partitions=24/24 files=24 size=478.45KB                |
      +-----------------------------------------------------------+
      

      In the query above, it is unnecessary to compute the "count(distinct bigint_col)" aggregate expression, so a single-phased aggregation would be sufficient.

      One way to fix this issue would be to defer creation of the AggregateInfo to the planning phase where the materialization of aggregate expressions is known. Currently, we create the AggregateInfo during analysis. Retroactively "fixing" an AggregateInfo during planning to remover the two phases seems complicated.

      This limitation inhibits other optimizations such as IMPALA-2499.

      Attachments

        Activity

          People

            Unassigned Unassigned
            alex.behm Alexander Behm
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: