Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24471

Add support for combiner in hash mode group aggregation

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Hive

      Description

      In map side group aggregation, partial grouped aggregation is calculated to reduce the data written to disk by map task. In case of hash aggregation, where the input data is not sorted, hash table is used (with sorting also being performed before flushing). If the hash table size increases beyond configurable limit, data is flushed to disk and new hash table is generated. If the reduction by hash table is less than min hash aggregation reduction calculated during compile time, the map side aggregation is converted to streaming mode. So if the first few batch of records does not result into significant reduction, then the mode is switched to streaming mode. This may have impact on performance, if the subsequent batch of records have less number of distinct values.

      To improve performance both in Hash and Streaming mode, a combiner can be added to the map task after the keys are sorted. This will make sure that the aggregation is done if possible and reduce the data written to disk.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                maheshk114 mahesh kumar behera
                Reporter:
                maheshk114 mahesh kumar behera
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m