Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-28428

Map hash aggregation performance degradation

    XMLWordPrintableJSON

Details

    Description

      The following ticket has been fixed to enable map hash aggregation, but performance degradation than when it is disabled.
      https://issues.apache.org/jira/browse/HIVE-23356

      I found a few reasons for this. If there are a large number of keys, the following log will be output in large volume, affecting performance. And, this can also cause an OOM.

      2024-08-02 05:21:53,675 [INFO] [TezChild] |exec.GroupByOperator|: Hash Tbl flush: #hash table = 171000
      2024-08-02 05:21:53,713 [INFO] [TezChild] |exec.GroupByOperator|: Hash Table flushed: new size = 153900
      

      By fixing this, we can improve performance as follows.
      Before:

      After:

      And, currently the flush size is fixed, but performance can be improved by changing it depending on the data:

      Attachments

        1. 2024-08-02 14.35.46.png
          21 kB
          Ryu Kobayashi
        2. image-2024-08-02-14-37-01-824.png
          21 kB
          Ryu Kobayashi
        3. image-2024-08-02-14-38-45-459.png
          21 kB
          Ryu Kobayashi

        Issue Links

          Activity

            People

              ryu_kobayashi Ryu Kobayashi
              ryu_kobayashi Ryu Kobayashi
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: