Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11794

GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • None
    • None

    Description

      The code in Vectorizer is as such:

          boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH);
      

      then, if it's reduce side:

          if (isMergePartial) {
              // Reduce Merge-Partial GROUP BY.
              // A merge-partial GROUP BY is fed by grouping by keys from reduce-shuffle.  It is the
              // first (or root) operator for its reduce task.
      ....
            } else {
              // Reduce Hash GROUP BY or global aggregation.
      ...
      

      In fact, this logic is missing the COMPLETE mode. Both from the comment:

       COMPLETE: complete 1-phase aggregation: iterate, terminate
      ...
      HASH: For non-distinct the same as PARTIAL1 but use hash-table-based aggregation
      ...
      PARTIAL1: partial aggregation - first phase: iterate, terminatePartial
      

      and from the explain plan like this (the query has multiple stages of aggregations over a union; the mapper does a partial hash aggregation for each side of the union, which is then followed by mergepartial, and 2nd stage as complete):

      Map Operator Tree:
      ...
              Group By Operator
                keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
                Reduce Output Operator
      ...
      feeding into
      
      Reduce Operator Tree:
        Group By Operator
          keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: bigint), KEY._col12 (type: bigint)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
          Group By Operator
            aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), sum(_col9), sum(_col10), sum(_col11), sum(_col12)
            keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: int)
            mode: complete
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
      

      it seems like COMPLETE is actually the global aggregation, and HASH isn't (or may not be).
      So, it seems like reduce-side COMPLETE should be handled on the else-path of the above if. For map-side, it doesn't check mode at all as far as I can see.
      Not sure if additional code changes are necessary after that, it may just work.

      Attachments

        1. HIVE-11794.01.patch
          67 kB
          Sergey Shelukhin
        2. HIVE-11794.patch
          59 kB
          Sergey Shelukhin

        Issue Links

          Activity

            People

              sershe Sergey Shelukhin
              sershe Sergey Shelukhin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: