Hive
  1. Hive
  2. HIVE-6120

Add GroupBy optimization to eliminate un-needed partial distinct aggregations

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None

      Description

      In most cases, partial distinct aggregation is not needed in map-side groupby. The exception is that with sorted bucketized tables partial distinct aggregation can be done by the mappers in some scenarios, as what is done by GroupByOptimzer.

      Currently, partial distinct aggregation is done in the map-side GroupBy and then shuffle of the partial result is done in the following ReduceSink operator, in cases where they are not needed. This wastes CPU cycles, memory and network bandwidth.

      This optimization eliminates un-needed partial distinct aggregations, which improves performance and reduces memory usage.

      For example,
      EXPLAIN SELECT key, count(DISTINCT value) FROM src GROUP BY key;

      Before optimization:

                    Group By Operator
                      aggregations:
                            expr: count(DISTINCT value)
                      bucketGroup: false
                      keys:
                            expr: key
                            type: int
                            expr: value
                            type: string
                      mode: hash
                      outputColumnNames: _col0, _col1, _col2
                      Reduce Output Operator
                        key expressions:
                              expr: _col0
                              type: int
                              expr: _col1
                              type: string
                        sort order: ++
                        Map-reduce partition columns:
                              expr: _col0
                              type: int
                        tag: -1
                        value expressions:
                              expr: _col2
                              type: bigint
      

      After optimization:

                    Group By Operator
                      bucketGroup: false
                      keys:
                            expr: key
                            type: int
                            expr: value
                            type: string
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Reduce Output Operator
                        key expressions:
                              expr: _col0
                              type: int
                              expr: _col1
                              type: string
                        sort order: ++
                        Map-reduce partition columns:
                              expr: _col0
                              type: int
                        tag: -1
      
      1. HIVE-6120.1.patch
        81 kB
        Sun Rui
      2. HIVE-6120.2.patch
        113 kB
        Sun Rui

        Activity

        Hide
        Sun Rui added a comment -
        Show
        Sun Rui added a comment - review board entry: https://reviews.apache.org/r/16504/
        Hide
        Hive QA added a comment -

        Overall: -1 at least one tests failed

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12620770/HIVE-6120.1.patch

        ERROR: -1 due to 3 failed/errored test(s), 4818 tests executed
        Failed tests:

        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_8
        org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby2
        org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby3
        

        Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/764/testReport
        Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/764/console

        Messages:

        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        Tests exited with: TestsFailedException: 3 tests failed
        

        This message is automatically generated.

        ATTACHMENT ID: 12620770

        Show
        Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12620770/HIVE-6120.1.patch ERROR: -1 due to 3 failed/errored test(s), 4818 tests executed Failed tests: org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_8 org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby3 Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/764/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/764/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed This message is automatically generated. ATTACHMENT ID: 12620770
        Hide
        Sun Rui added a comment -

        Fixed test failures.

        Show
        Sun Rui added a comment - Fixed test failures.
        Hide
        Hive QA added a comment -

        Overall: +1 all checks pass

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12620791/HIVE-6120.2.patch

        SUCCESS: +1 4818 tests passed

        Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/766/testReport
        Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/766/console

        Messages:

        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        

        This message is automatically generated.

        ATTACHMENT ID: 12620791

        Show
        Hive QA added a comment - Overall : +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12620791/HIVE-6120.2.patch SUCCESS: +1 4818 tests passed Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/766/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/766/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase This message is automatically generated. ATTACHMENT ID: 12620791

          People

          • Assignee:
            Sun Rui
            Reporter:
            Sun Rui
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development