[HIVE-11794] GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: None
Labels:
None

Description

The code in Vectorizer is as such:

    boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH);

then, if it's reduce side:

    if (isMergePartial) {
        // Reduce Merge-Partial GROUP BY.
        // A merge-partial GROUP BY is fed by grouping by keys from reduce-shuffle.  It is the
        // first (or root) operator for its reduce task.
....
      } else {
        // Reduce Hash GROUP BY or global aggregation.
...

In fact, this logic is missing the COMPLETE mode. Both from the comment:

 COMPLETE: complete 1-phase aggregation: iterate, terminate
...
HASH: For non-distinct the same as PARTIAL1 but use hash-table-based aggregation
...
PARTIAL1: partial aggregation - first phase: iterate, terminatePartial

and from the explain plan like this (the query has multiple stages of aggregations over a union; the mapper does a partial hash aggregation for each side of the union, which is then followed by mergepartial, and 2nd stage as complete):

Map Operator Tree:
...
        Group By Operator
          keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint)
          mode: hash
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
          Reduce Output Operator
...
feeding into

Reduce Operator Tree:
  Group By Operator
    keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: bigint), KEY._col12 (type: bigint)
    mode: mergepartial
    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
    Group By Operator
      aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), sum(_col9), sum(_col10), sum(_col11), sum(_col12)
      keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: int)
      mode: complete
      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

it seems like COMPLETE is actually the global aggregation, and HASH isn't (or may not be).
So, it seems like reduce-side COMPLETE should be handled on the else-path of the above if. For map-side, it doesn't check mode at all as far as I can see.
Not sure if additional code changes are necessary after that, it may just work.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-11794.01.patch
15/Sep/15 22:19
67 kB
Sergey Shelukhin
HIVE-11794.patch
11/Sep/15 23:21
59 kB
Sergey Shelukhin

Issue Links

relates to

HIVE-13713 We miss vectorization in a case of count(*) when aggregation mode is COMPLETE

Closed

Activity

People

Assignee:: Sergey Shelukhin

Reporter:: Sergey Shelukhin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Sep/15 00:46

Updated:: 08/May/16 06:12

Resolved:: 23/Sep/15 01:12