Hive
  1. Hive
  2. HIVE-2206

add a new optimizer for query correlation discovery and optimization

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.12.0
    • Fix Version/s: 0.12.0
    • Component/s: Query Processor
    • Labels:
      None
    • Release Note:
      This optimizer exploits the intra-query correlations and merge multiple correlated MapReduce jobs into one jobs.

      Description

      This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The paper and slides of YSmart are linked at the bottom.

      Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.

      1. Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
      2. Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
      3. Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.

      The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.

      1. There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
      2. All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and
      3. No self join is involved in those correlated MR jobs.

      Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.

      Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers.

      There are several work that can be done in future to improve this optimizer. Here are three examples.

      1. Support queries only involve TC;
      2. Support queries in which input tables of correlated MR jobs involves intermediate tables; and
      3. Optimize queries involving self join.

      References:
      Paper and presentation of YSmart.
      Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
      Slides: http://sdrv.ms/UpwJJc

      1. YSmartPatchForHive.patch
        251 kB
        He Yongqiang
      2. HIVE-2206.1.patch.txt
        190 kB
        Yin Huai
      3. HIVE-2206.2.patch.txt
        190 kB
        Yin Huai
      4. HIVE-2206.3.patch.txt
        190 kB
        Yin Huai
      5. HIVE-2206.4.patch.txt
        190 kB
        Yin Huai
      6. HIVE-2206.5.patch.txt
        255 kB
        Yin Huai
      7. HIVE-2206.5-1.patch.txt
        209 kB
        Yin Huai
      8. HIVE-2206.6.patch.txt
        156 kB
        Yin Huai
      9. HIVE-2206.7.patch.txt
        221 kB
        Yin Huai
      10. HIVE-2206.8.r1224646.patch.txt
        219 kB
        Yin Huai
      11. testQueries.2.q
        5 kB
        Yin Huai
      12. HIVE-2206.8-r1237253.patch.txt
        225 kB
        Yin Huai
      13. HIVE-2206.10-r1384442.patch.txt
        341 kB
        Yin Huai
      14. HIVE-2206.11-r1385084.patch.txt
        250 kB
        Yin Huai
      15. HIVE-2206.12-r1386996.patch.txt
        308 kB
        Yin Huai
      16. HIVE-2206.13-r1389072.patch.txt
        500 kB
        Yin Huai
      17. HIVE-2206.14-r1389704.patch.txt
        499 kB
        Yin Huai
      18. HIVE-2206.15-r1392491.patch.txt
        492 kB
        Yin Huai
      19. HIVE-2206.16-r1399936.patch.txt
        492 kB
        Yin Huai
      20. HIVE-2206.17-r1404933.patch.txt
        491 kB
        Yin Huai
      21. HIVE-2206.18-r1407720.patch.txt
        491 kB
        Yin Huai
      22. HIVE-2206.19-r1410581.patch.txt
        508 kB
        Yin Huai
      23. HIVE-2206.20-r1434012.patch.txt
        512 kB
        Yin Huai
      24. HIVE-2206.D11097.1.patch
        756 kB
        Phabricator
      25. HIVE-2206.D11097.2.patch
        586 kB
        Phabricator
      26. HIVE-2206.D11097.3.patch
        581 kB
        Phabricator
      27. HIVE-2206.D11097.4.patch
        581 kB
        Phabricator
      28. HIVE-2206.D11097.5.patch
        582 kB
        Phabricator
      29. HIVE-2206.D11097.6.patch
        698 kB
        Phabricator
      30. HIVE-2206.D11097.7.patch
        725 kB
        Phabricator
      31. HIVE-2206.D11097.8.patch
        868 kB
        Phabricator
      32. HIVE-2206.D11097.9.patch
        923 kB
        Phabricator
      33. HIVE-2206.D11097.10.patch
        980 kB
        Phabricator
      34. HIVE-2206.D11097.11.patch
        7 kB
        Phabricator
      35. HIVE-2206.D11097.12.patch
        1003 kB
        Phabricator
      36. HIVE-2206.D11097.13.patch
        1.03 MB
        Phabricator
      37. HIVE-2206.D11097.14.patch
        1.09 MB
        Phabricator
      38. HIVE-2206.D11097.15.patch
        1.10 MB
        Phabricator
      39. HIVE-2206.D11097.16.patch
        1.20 MB
        Phabricator
      40. HIVE-2206.D11097.17.patch
        1.19 MB
        Phabricator
      41. HIVE-2206.D11097.18.patch
        1.20 MB
        Phabricator
      42. HIVE-2206.D11097.19.patch
        1.20 MB
        Phabricator
      43. HIVE-2206.patch
        1.20 MB
        Yin Huai
      44. HIVE-2206.D11097.20.patch
        46 kB
        Phabricator
      45. HIVE-2206.D11097.21.patch
        5 kB
        Phabricator
      46. HIVE-2206.D11097.22.patch
        5 kB
        Phabricator

        Issue Links

          Activity

          Hide
          Lefty Leverenz added a comment -

          hive.optimize.correlation is documented here:

          Show
          Lefty Leverenz added a comment - hive.optimize.correlation is documented here: Configuration Properties – hive.optimize.correlation
          Hide
          Lefty Leverenz added a comment -

          The correlation optimizer is documented here:

          Show
          Lefty Leverenz added a comment - The correlation optimizer is documented here: Correlation Optimizer
          Hide
          Lefty Leverenz added a comment -

          This added hive.optimize.correlation in HiveConf.java with a description in hive-default.xml.template, so the parameter needs to be documented in the wiki (Configuration Properties).

          Note that HIVE-7362 proposes to change the default for hive.optimize.correlation to true.

          General documentation for the correlation optimizer is covered by HIVE-5130.

          Show
          Lefty Leverenz added a comment - This added hive.optimize.correlation in HiveConf.java with a description in hive-default.xml.template, so the parameter needs to be documented in the wiki (Configuration Properties). Note that HIVE-7362 proposes to change the default for hive.optimize.correlation to true. General documentation for the correlation optimizer is covered by HIVE-5130 .
          Hide
          Ashutosh Chauhan added a comment -

          This issue has been fixed and released as part of 0.12 release. If you find further issues, please create a new jira and link it to this one.

          Show
          Ashutosh Chauhan added a comment - This issue has been fixed and released as part of 0.12 release. If you find further issues, please create a new jira and link it to this one.
          Hide
          Phabricator added a comment -

          yhuai has closed the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          Closed by commit rHIVE1504395 (authored by hashutosh).

          CHANGED PRIOR TO COMMIT
          https://reviews.facebook.net/D11097?vs=39099&id=40161#toc

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          COMMIT
          https://reviews.facebook.net/rHIVE1504395

          To: JIRA, ashutoshc, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai has closed the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Closed by commit rHIVE1504395 (authored by hashutosh). CHANGED PRIOR TO COMMIT https://reviews.facebook.net/D11097?vs=39099&id=40161#toc REVISION DETAIL https://reviews.facebook.net/D11097 COMMIT https://reviews.facebook.net/rHIVE1504395 To: JIRA, ashutoshc, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          HIVE-5149

          Reviewers: ashutoshc, JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=39087&id=39099#toc

          BRANCH
          trunk

          ARCANIST PROJECT
          hive

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/test/results/clientpositive/groupby2_map_skew.q.out
          ql/src/test/results/clientpositive/groupby_cube1.q.out
          ql/src/test/results/clientpositive/groupby_rollup1.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out

          To: JIRA, ashutoshc, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". HIVE-5149 Reviewers: ashutoshc, JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=39087&id=39099#toc BRANCH trunk ARCANIST PROJECT hive AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/test/results/clientpositive/groupby2_map_skew.q.out ql/src/test/results/clientpositive/groupby_cube1.q.out ql/src/test/results/clientpositive/groupby_rollup1.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out To: JIRA, ashutoshc, yhuai Cc: brock
          Hide
          Yin Huai added a comment -

          the last patch was for HIVE-5149...

          Show
          Yin Huai added a comment - the last patch was for HIVE-5149 ...
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          • Merge remote-tracking branch 'upstream/trunk' into trunk
          • Merge remote-tracking branch 'upstream/trunk' into trunk
          • Merge remote-tracking branch 'upstream/trunk' into trunk
          • HIVE-5149 [jira] ReduceSinkDeDuplication can pick the wrong partitioning columns

          Reviewers: ashutoshc, JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=35907&id=39087#toc

          BRANCH
          HIVE-5149

          ARCANIST PROJECT
          hive

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/test/results/clientpositive/groupby2_map_skew.q.out
          ql/src/test/results/clientpositive/groupby_cube1.q.out
          ql/src/test/results/clientpositive/groupby_rollup1.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out

          To: JIRA, ashutoshc, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Merge remote-tracking branch 'upstream/trunk' into trunk Merge remote-tracking branch 'upstream/trunk' into trunk Merge remote-tracking branch 'upstream/trunk' into trunk HIVE-5149 [jira] ReduceSinkDeDuplication can pick the wrong partitioning columns Reviewers: ashutoshc, JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=35907&id=39087#toc BRANCH HIVE-5149 ARCANIST PROJECT hive AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/test/results/clientpositive/groupby2_map_skew.q.out ql/src/test/results/clientpositive/groupby_cube1.q.out ql/src/test/results/clientpositive/groupby_rollup1.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out To: JIRA, ashutoshc, yhuai Cc: brock
          Hide
          Yin Huai added a comment -

          i opened https://issues.apache.org/jira/browse/HIVE-4972 to update code generated by thrift

          Show
          Yin Huai added a comment - i opened https://issues.apache.org/jira/browse/HIVE-4972 to update code generated by thrift
          Hide
          Yin Huai added a comment -

          thanks Sergey Shelukhin. I will make the change

          Show
          Yin Huai added a comment - thanks Sergey Shelukhin . I will make the change
          Hide
          Sergey Shelukhin added a comment -

          When I run thrift compile right now on clean trunk, I get some changes that might be related to this patch e.g.

          --- ql/src/gen/thrift/gen-cpp/queryplan_types.cpp
          +++ ql/src/gen/thrift/gen-cpp/queryplan_types.cpp
          @@ -49,7 +49,9 @@ int _kOperatorTypeValues[] = {
             OperatorType::LATERALVIEWFORWARD,
             OperatorType::HASHTABLESINK,
             OperatorType::HASHTABLEDUMMY,
          -  OperatorType::PTF
          +  OperatorType::PTF,
          +  OperatorType::MUX,
          +  OperatorType::DEMUX
           };
          

          Ashutosh Chauhan Yin Huai do you guys want to update it?

          Show
          Sergey Shelukhin added a comment - When I run thrift compile right now on clean trunk, I get some changes that might be related to this patch e.g. --- ql/src/gen/thrift/gen-cpp/queryplan_types.cpp +++ ql/src/gen/thrift/gen-cpp/queryplan_types.cpp @@ -49,7 +49,9 @@ int _kOperatorTypeValues[] = { OperatorType::LATERALVIEWFORWARD, OperatorType::HASHTABLESINK, OperatorType::HASHTABLEDUMMY, - OperatorType::PTF + OperatorType::PTF, + OperatorType::MUX, + OperatorType::DEMUX }; Ashutosh Chauhan Yin Huai do you guys want to update it?
          Hide
          Phabricator added a comment -

          yhuai has commented on the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          Please ignore the latest diff.... it is for HIVE-4877...

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          BRANCH
          HIVE-4877

          ARCANIST PROJECT
          hive

          To: JIRA, ashutoshc, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai has commented on the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Please ignore the latest diff.... it is for HIVE-4877 ... REVISION DETAIL https://reviews.facebook.net/D11097 BRANCH HIVE-4877 ARCANIST PROJECT hive To: JIRA, ashutoshc, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          • HIVE-4877 [jira] In ExecReducer, remove tag from the row which will be passed to the first Operator at the Reduce-side

          Reviewers: ashutoshc, JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=35841&id=35907#toc

          BRANCH
          HIVE-4877

          ARCANIST PROJECT
          hive

          AFFECTED FILES
          data/files/kv1kv2.cogroup.txt
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java
          ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java

          To: JIRA, ashutoshc, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". HIVE-4877 [jira] In ExecReducer, remove tag from the row which will be passed to the first Operator at the Reduce-side Reviewers: ashutoshc, JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=35841&id=35907#toc BRANCH HIVE-4877 ARCANIST PROJECT hive AFFECTED FILES data/files/kv1kv2.cogroup.txt ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java To: JIRA, ashutoshc, yhuai Cc: brock
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hive-trunk-hadoop2-ptest #19 (See https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/19/)
          HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization
          (Yin Huai via Ashutosh Chauhan)

          Summary:
          update test results

          This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The paper and slides of YSmart are linked at the bottom.

          Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.

          Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
          Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
          Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.

          The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.

          There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
          All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and
          No self join is involved in those correlated MR jobs.

          Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.

          Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers.

          There are several work that can be done in future to improve this optimizer. Here are three examples.

          Support queries only involve TC;
          Support queries in which input tables of correlated MR jobs involves intermediate tables; and
          Optimize queries involving self join.

          References:
          Paper and presentation of YSmart.
          Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
          Slides: http://sdrv.ms/UpwJJc

          Test Plan: EMPTY

          Reviewers: JIRA, ashutoshc

          Reviewed By: ashutoshc

          CC: brock

          Differential Revision: https://reviews.facebook.net/D11097 (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1504395)

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/conf/hive-default.xml.template
          • /hive/trunk/ql/if/queryplan.thrift
          • /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer10.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer11.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer12.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer13.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer14.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer6.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer7.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer8.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer9.q
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer10.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer11.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer12.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer13.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer14.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml
          Show
          Hudson added a comment - FAILURE: Integrated in Hive-trunk-hadoop2-ptest #19 (See https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/19/ ) HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization (Yin Huai via Ashutosh Chauhan) Summary: update test results This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart ( http://ysmart.cse.ohio-state.edu/ ). The paper and slides of YSmart are linked at the bottom. Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job. Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint; Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key; Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node. The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions. There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists); All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and No self join is involved in those correlated MR jobs. Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs. Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. There are several work that can be done in future to improve this optimizer. Here are three examples. Support queries only involve TC; Support queries in which input tables of correlated MR jobs involves intermediate tables; and Optimize queries involving self join. References: Paper and presentation of YSmart. Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf Slides: http://sdrv.ms/UpwJJc Test Plan: EMPTY Reviewers: JIRA, ashutoshc Reviewed By: ashutoshc CC: brock Differential Revision: https://reviews.facebook.net/D11097 (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1504395 ) /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/conf/hive-default.xml.template /hive/trunk/ql/if/queryplan.thrift /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer10.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer11.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer12.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer13.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer14.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer6.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer7.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer8.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer9.q /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer10.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer11.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer12.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer13.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer14.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer6.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer7.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer8.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer9.q.out /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hive-trunk-hadoop1-ptest #90 (See https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/90/)
          HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization
          (Yin Huai via Ashutosh Chauhan)

          Summary:
          update test results

          This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The paper and slides of YSmart are linked at the bottom.

          Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.

          Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
          Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
          Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.

          The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.

          There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
          All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and
          No self join is involved in those correlated MR jobs.

          Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.

          Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers.

          There are several work that can be done in future to improve this optimizer. Here are three examples.

          Support queries only involve TC;
          Support queries in which input tables of correlated MR jobs involves intermediate tables; and
          Optimize queries involving self join.

          References:
          Paper and presentation of YSmart.
          Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
          Slides: http://sdrv.ms/UpwJJc

          Test Plan: EMPTY

          Reviewers: JIRA, ashutoshc

          Reviewed By: ashutoshc

          CC: brock

          Differential Revision: https://reviews.facebook.net/D11097 (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1504395)

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/conf/hive-default.xml.template
          • /hive/trunk/ql/if/queryplan.thrift
          • /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer10.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer11.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer12.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer13.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer14.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer6.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer7.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer8.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer9.q
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer10.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer11.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer12.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer13.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer14.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml
          Show
          Hudson added a comment - SUCCESS: Integrated in Hive-trunk-hadoop1-ptest #90 (See https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/90/ ) HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization (Yin Huai via Ashutosh Chauhan) Summary: update test results This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart ( http://ysmart.cse.ohio-state.edu/ ). The paper and slides of YSmart are linked at the bottom. Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job. Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint; Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key; Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node. The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions. There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists); All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and No self join is involved in those correlated MR jobs. Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs. Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. There are several work that can be done in future to improve this optimizer. Here are three examples. Support queries only involve TC; Support queries in which input tables of correlated MR jobs involves intermediate tables; and Optimize queries involving self join. References: Paper and presentation of YSmart. Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf Slides: http://sdrv.ms/UpwJJc Test Plan: EMPTY Reviewers: JIRA, ashutoshc Reviewed By: ashutoshc CC: brock Differential Revision: https://reviews.facebook.net/D11097 (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1504395 ) /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/conf/hive-default.xml.template /hive/trunk/ql/if/queryplan.thrift /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer10.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer11.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer12.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer13.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer14.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer6.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer7.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer8.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer9.q /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer10.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer11.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer12.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer13.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer14.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer6.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer7.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer8.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer9.q.out /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml
          Hide
          Prasanth Jayachandran added a comment -

          Finally (after 2 yrs) it's in! Great job Yin Huai

          Show
          Prasanth Jayachandran added a comment - Finally (after 2 yrs) it's in! Great job Yin Huai
          Hide
          Yin Huai added a comment -

          Thanks Edward

          Show
          Yin Huai added a comment - Thanks Edward
          Hide
          Edward Capriolo added a comment -

          Pat yourself on the back and take a well earned rest!

          Show
          Edward Capriolo added a comment - Pat yourself on the back and take a well earned rest!
          Hide
          Yin Huai added a comment -

          Thank you, guys!

          Thanks Ashutosh and Gunther for your help! This patch has been improved a lot. I really appreciate your comments and I really enjoy our discussions

          Show
          Yin Huai added a comment - Thank you, guys! Thanks Ashutosh and Gunther for your help! This patch has been improved a lot. I really appreciate your comments and I really enjoy our discussions
          Hide
          Brock Noland added a comment -

          Nice work Yin! This was one hell of an effort!

          Show
          Brock Noland added a comment - Nice work Yin! This was one hell of an effort!
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hive-trunk-h0.21 #2205 (See https://builds.apache.org/job/Hive-trunk-h0.21/2205/)
          HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization
          (Yin Huai via Ashutosh Chauhan)

          Summary:
          update test results

          This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The paper and slides of YSmart are linked at the bottom.

          Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.

          Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
          Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
          Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.

          The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.

          There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
          All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and
          No self join is involved in those correlated MR jobs.

          Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.

          Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers.

          There are several work that can be done in future to improve this optimizer. Here are three examples.

          Support queries only involve TC;
          Support queries in which input tables of correlated MR jobs involves intermediate tables; and
          Optimize queries involving self join.

          References:
          Paper and presentation of YSmart.
          Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
          Slides: http://sdrv.ms/UpwJJc

          Test Plan: EMPTY

          Reviewers: JIRA, ashutoshc

          Reviewed By: ashutoshc

          CC: brock

          Differential Revision: https://reviews.facebook.net/D11097 (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1504395)

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/conf/hive-default.xml.template
          • /hive/trunk/ql/if/queryplan.thrift
          • /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer10.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer11.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer12.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer13.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer14.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer6.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer7.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer8.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer9.q
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer10.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer11.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer12.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer13.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer14.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml
          Show
          Hudson added a comment - FAILURE: Integrated in Hive-trunk-h0.21 #2205 (See https://builds.apache.org/job/Hive-trunk-h0.21/2205/ ) HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization (Yin Huai via Ashutosh Chauhan) Summary: update test results This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart ( http://ysmart.cse.ohio-state.edu/ ). The paper and slides of YSmart are linked at the bottom. Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job. Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint; Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key; Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node. The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions. There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists); All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and No self join is involved in those correlated MR jobs. Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs. Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. There are several work that can be done in future to improve this optimizer. Here are three examples. Support queries only involve TC; Support queries in which input tables of correlated MR jobs involves intermediate tables; and Optimize queries involving self join. References: Paper and presentation of YSmart. Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf Slides: http://sdrv.ms/UpwJJc Test Plan: EMPTY Reviewers: JIRA, ashutoshc Reviewed By: ashutoshc CC: brock Differential Revision: https://reviews.facebook.net/D11097 (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1504395 ) /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/conf/hive-default.xml.template /hive/trunk/ql/if/queryplan.thrift /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer10.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer11.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer12.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer13.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer14.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer6.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer7.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer8.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer9.q /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer10.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer11.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer12.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer13.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer14.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer6.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer7.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer8.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer9.q.out /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml
          Hide
          Gunther Hagleitner added a comment -

          Yin Huai Thanks for sticking with this. This is awesome!

          Show
          Gunther Hagleitner added a comment - Yin Huai Thanks for sticking with this. This is awesome!
          Hide
          Ashutosh Chauhan added a comment -

          Committed to trunk. Thanks, Yin!

          Show
          Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Yin!
          Hide
          Edward Capriolo added a comment -

          Ok this patch has been +1 multiple times. Was even +1 a year ago. Can we commit now? I would hate to have to see Yin rebase again.

          Show
          Edward Capriolo added a comment - Ok this patch has been +1 multiple times. Was even +1 a year ago. Can we commit now? I would hate to have to see Yin rebase again.
          Hide
          Hive QA added a comment -

          Overall: +1 all checks pass

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12592900/HIVE-2206.patch

          SUCCESS: +1 all tests passed

          Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/71/testReport
          Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/71/console

          Messages:
          Executing org.apache.hive.ptest.execution.CleanupPhase
          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase

          This message is automatically generated.

          Show
          Hive QA added a comment - Overall : +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12592900/HIVE-2206.patch SUCCESS: +1 all tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/71/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/71/console Messages: Executing org.apache.hive.ptest.execution.CleanupPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase This message is automatically generated.
          Hide
          Yin Huai added a comment -

          try to trigger the precommit test.

          Show
          Yin Huai added a comment - try to trigger the precommit test.
          Hide
          Phabricator added a comment -

          ashutoshc has accepted the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          +1 Awesome work, Yin!
          Beautiful ascii art too : ) Finally some great comments in code. : )

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          BRANCH
          HIVE-2206-3671-20130716

          ARCANIST PROJECT
          hive

          To: JIRA, ashutoshc, yhuai
          Cc: brock

          Show
          Phabricator added a comment - ashutoshc has accepted the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". +1 Awesome work, Yin! Beautiful ascii art too : ) Finally some great comments in code. : ) REVISION DETAIL https://reviews.facebook.net/D11097 BRANCH HIVE-2206 -3671-20130716 ARCANIST PROJECT hive To: JIRA, ashutoshc, yhuai Cc: brock
          Hide
          Yin Huai added a comment -

          address Ashutosh's comments

          Show
          Yin Huai added a comment - address Ashutosh's comments
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          • address Ashutosh's comments

          Reviewers: ashutoshc, JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=35721&id=35841#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer10.q
          ql/src/test/queries/clientpositive/correlationoptimizer11.q
          ql/src/test/queries/clientpositive/correlationoptimizer12.q
          ql/src/test/queries/clientpositive/correlationoptimizer13.q
          ql/src/test/queries/clientpositive/correlationoptimizer14.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/queries/clientpositive/correlationoptimizer7.q
          ql/src/test/queries/clientpositive/correlationoptimizer8.q
          ql/src/test/queries/clientpositive/correlationoptimizer9.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer10.q.out
          ql/src/test/results/clientpositive/correlationoptimizer11.q.out
          ql/src/test/results/clientpositive/correlationoptimizer12.q.out
          ql/src/test/results/clientpositive/correlationoptimizer13.q.out
          ql/src/test/results/clientpositive/correlationoptimizer14.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, ashutoshc, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". address Ashutosh's comments Reviewers: ashutoshc, JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=35721&id=35841#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer10.q ql/src/test/queries/clientpositive/correlationoptimizer11.q ql/src/test/queries/clientpositive/correlationoptimizer12.q ql/src/test/queries/clientpositive/correlationoptimizer13.q ql/src/test/queries/clientpositive/correlationoptimizer14.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/queries/clientpositive/correlationoptimizer7.q ql/src/test/queries/clientpositive/correlationoptimizer8.q ql/src/test/queries/clientpositive/correlationoptimizer9.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer10.q.out ql/src/test/results/clientpositive/correlationoptimizer11.q.out ql/src/test/results/clientpositive/correlationoptimizer12.q.out ql/src/test/results/clientpositive/correlationoptimizer13.q.out ql/src/test/results/clientpositive/correlationoptimizer14.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out ql/src/test/results/clientpositive/correlationoptimizer7.q.out ql/src/test/results/clientpositive/correlationoptimizer8.q.out ql/src/test/results/clientpositive/correlationoptimizer9.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, ashutoshc, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai has commented on the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:171 table should not be null at here. I will throw an exception when we have "table==null"
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:153 Since CommonJoinTaskDispatcher is in the phase of physical optimization, seems that we cannot refactor this part of code in an easy way. I suggest refactoring it in a follow-up jira.

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          BRANCH
          HIVE-2206-3671-20130711

          ARCANIST PROJECT
          hive

          To: JIRA, ashutoshc, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai has commented on the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:171 table should not be null at here. I will throw an exception when we have "table==null" ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:153 Since CommonJoinTaskDispatcher is in the phase of physical optimization, seems that we cannot refactor this part of code in an easy way. I suggest refactoring it in a follow-up jira. REVISION DETAIL https://reviews.facebook.net/D11097 BRANCH HIVE-2206 -3671-20130711 ARCANIST PROJECT hive To: JIRA, ashutoshc, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai has commented on the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          Another check point. Will finish soon and generate new patch

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:250 this function is not needed. I have deleted it.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:271 Done
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:284 Done
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:368 Done
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:453 Done
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:526 Done
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:590 it is not used. I have deleted it.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:597 Done
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:630 I have deleted it.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:647 I have deleted it. We can extend the scope of this optimizer in a follow-up jira.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:45 Done
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:67 Done
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:79 Done
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:83 Done

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          BRANCH
          HIVE-2206-3671-20130711

          ARCANIST PROJECT
          hive

          To: JIRA, ashutoshc, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai has commented on the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Another check point. Will finish soon and generate new patch INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:250 this function is not needed. I have deleted it. ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:271 Done ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:284 Done ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:368 Done ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:453 Done ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:526 Done ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:590 it is not used. I have deleted it. ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:597 Done ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:630 I have deleted it. ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:647 I have deleted it. We can extend the scope of this optimizer in a follow-up jira. ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:45 Done ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:67 Done ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:79 Done ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:83 Done REVISION DETAIL https://reviews.facebook.net/D11097 BRANCH HIVE-2206 -3671-20130711 ARCANIST PROJECT hive To: JIRA, ashutoshc, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai has commented on the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          Have addressed some comments. Will address the rest of comments later.

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:59-60 These OIs are not needed. I have removed them.
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:61 JoinOperators relies on the tag to function correctly. I will add comment to explain why we need revert the newTag to oldTag.
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:114 Done.
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:137 Done.
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:150 Done
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:174 Done
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:41 Done
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:75 Done
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:93 Done
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:135 Done
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:182 Yes, I have changed it to numParents = getNumParent();
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:222 Done. Since there is another check in initializeOp, I will throw the exception at there.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java:229 Done

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          BRANCH
          HIVE-2206-3671-20130711

          ARCANIST PROJECT
          hive

          To: JIRA, ashutoshc, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai has commented on the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Have addressed some comments. Will address the rest of comments later. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:59-60 These OIs are not needed. I have removed them. ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:61 JoinOperators relies on the tag to function correctly. I will add comment to explain why we need revert the newTag to oldTag. ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:114 Done. ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:137 Done. ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:150 Done ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:174 Done ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:41 Done ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:75 Done ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:93 Done ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:135 Done ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:182 Yes, I have changed it to numParents = getNumParent(); ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:222 Done. Since there is another check in initializeOp, I will throw the exception at there. ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java:229 Done REVISION DETAIL https://reviews.facebook.net/D11097 BRANCH HIVE-2206 -3671-20130711 ARCANIST PROJECT hive To: JIRA, ashutoshc, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai has commented on the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          Add an explanation on startGroup. Will start to address the rest of comments tomorrow.

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java:334 Since we can have a operator tree with multiple JoinOperators and GroupByOperators inside, we need to propagate the startGroup to all operators in the operator tree. For queries which are not optimized by this patch, we can have at most 1 JoinOperator (at the beginning of the reduce-side) and 2 GroupByOperators (1 at the beginning of the reduce-side one and 1 hash mode one just before FileSinkOperator). This change will not affect those operators.
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:710 Please see my reply to the same change made in CommonJoinOperator
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:180 Seems an enum does not have a method to return a list of values with the type of string.

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          BRANCH
          HIVE-2206-3671-20130711

          ARCANIST PROJECT
          hive

          To: JIRA, ashutoshc, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai has commented on the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Add an explanation on startGroup. Will start to address the rest of comments tomorrow. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java:334 Since we can have a operator tree with multiple JoinOperators and GroupByOperators inside, we need to propagate the startGroup to all operators in the operator tree. For queries which are not optimized by this patch, we can have at most 1 JoinOperator (at the beginning of the reduce-side) and 2 GroupByOperators (1 at the beginning of the reduce-side one and 1 hash mode one just before FileSinkOperator). This change will not affect those operators. ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:710 Please see my reply to the same change made in CommonJoinOperator ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:180 Seems an enum does not have a method to return a list of values with the type of string. REVISION DETAIL https://reviews.facebook.net/D11097 BRANCH HIVE-2206 -3671-20130711 ARCANIST PROJECT hive To: JIRA, ashutoshc, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          ashutoshc has commented on the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          Few more comments. See which of these apply. If they doesn't apply, feel free to ignore.

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:250 What does this function do?
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:171 Will be good to add comment stating when table == null
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:153 It seems like lot of logic here is shared with CommonJoinTaskDispatcher. It will be good to have that refactored so that its reusable here.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:284 Seems like this method always return true. So, this is not required.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:271 Add a comment saying that tree walking is done and now you will apply transformations which you have detected.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:590 Do we really need hasBeenRemoved() check?
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:597 getKeyCols().size() is not a good check. I will recommend to test explictly for operators which we are supporting right now.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:630 Do we still need this?
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:647 We should do jobFlowCorrelation as another pass in transform().
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:526 It will be good to add some ascii art which shows what tree structure we are returning from this function.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:368 It will good to add javadoc for this method.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:453 Didn't understand this comment. Probably we can word it better.
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:61 I dont think we need to revert to oldTag here. We can keep using newTag.
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:59-60 Doesnt look like you are using these OIs. Probably we can get rid of these.
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:174 It will be good to add comments for whats the intent of this for loop.
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:182 Why is it called NumOriginalParents? can it be just numOfParents
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:93 There is already a forward in Demux, this should not be needed.
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:75 You dont need this constructor
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:83 Looks like this map is not used anymore, lets get rid of this.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:79 It will be good to add comments about what this method is intending to do
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:67 This method straight away calls another method. We can eliminate this one.

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          BRANCH
          HIVE-2206-3671-20130711

          ARCANIST PROJECT
          hive

          To: JIRA, ashutoshc, yhuai
          Cc: brock

          Show
          Phabricator added a comment - ashutoshc has commented on the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Few more comments. See which of these apply. If they doesn't apply, feel free to ignore. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:250 What does this function do? ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:171 Will be good to add comment stating when table == null ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:153 It seems like lot of logic here is shared with CommonJoinTaskDispatcher. It will be good to have that refactored so that its reusable here. ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:284 Seems like this method always return true. So, this is not required. ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:271 Add a comment saying that tree walking is done and now you will apply transformations which you have detected. ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:590 Do we really need hasBeenRemoved() check? ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:597 getKeyCols().size() is not a good check. I will recommend to test explictly for operators which we are supporting right now. ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:630 Do we still need this? ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:647 We should do jobFlowCorrelation as another pass in transform(). ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:526 It will be good to add some ascii art which shows what tree structure we are returning from this function. ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:368 It will good to add javadoc for this method. ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:453 Didn't understand this comment. Probably we can word it better. ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:61 I dont think we need to revert to oldTag here. We can keep using newTag. ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:59-60 Doesnt look like you are using these OIs. Probably we can get rid of these. ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:174 It will be good to add comments for whats the intent of this for loop. ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:182 Why is it called NumOriginalParents? can it be just numOfParents ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:93 There is already a forward in Demux, this should not be needed. ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:75 You dont need this constructor ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:83 Looks like this map is not used anymore, lets get rid of this. ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:79 It will be good to add comments about what this method is intending to do ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:67 This method straight away calls another method. We can eliminate this one. REVISION DETAIL https://reviews.facebook.net/D11097 BRANCH HIVE-2206 -3671-20130711 ARCANIST PROJECT hive To: JIRA, ashutoshc, yhuai Cc: brock
          Hide
          Shane Pratt added a comment -

          Thank you for your message.

          I will be traveling the next several days so there may be a delay in my response to your email.

          If you need to reach me now, please call the number below. Otherwise, I will respond to you as soon as I can.

          Shane
          512-590-3925

          Show
          Shane Pratt added a comment - Thank you for your message. I will be traveling the next several days so there may be a delay in my response to your email. If you need to reach me now, please call the number below. Otherwise, I will respond to you as soon as I can. Shane 512-590-3925
          Hide
          Phabricator added a comment -

          ashutoshc has requested changes to the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          Minor comments, mostly around improving documentation in code.

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java:334 Does this patch makes this necessary? Or, you added it just for completeness?
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:114 Better to do it as List<Object> thisRow = (List<Object>) row;
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:137 Will be good to add comments for all these maps. What mappings they are tracking?
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:150 Will be good to add some ascii art showing an example of such a plan.
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:710 Is this necessary?
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:41 I understand this but it will be confusing for someone reading this comment for first time because before this patch RS operator is always in map side. We need to reword this so its easier to read.
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:135 Can you add a comment when this boolean will be true and when it will be false.
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:222 Lets throw an exception here. if (childOperatorsArray.length != 1) throw new HiveException ("Expected number of children is 1. Found : " + childOperatorsArray.length)
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:180 This should not be required. You can always get all the values of enum by using valueOf() method on enum.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java:229 It will be good to add javadoc for this explaining why we should leave it as it is?
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:45 It will be good to add javadoc for this class.

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          BRANCH
          HIVE-2206-3671-20130711

          ARCANIST PROJECT
          hive

          To: JIRA, ashutoshc, yhuai
          Cc: brock

          Show
          Phabricator added a comment - ashutoshc has requested changes to the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Minor comments, mostly around improving documentation in code. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java:334 Does this patch makes this necessary? Or, you added it just for completeness? ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:114 Better to do it as List<Object> thisRow = (List<Object>) row; ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:137 Will be good to add comments for all these maps. What mappings they are tracking? ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:150 Will be good to add some ascii art showing an example of such a plan. ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:710 Is this necessary? ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:41 I understand this but it will be confusing for someone reading this comment for first time because before this patch RS operator is always in map side. We need to reword this so its easier to read. ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:135 Can you add a comment when this boolean will be true and when it will be false. ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:222 Lets throw an exception here. if (childOperatorsArray.length != 1) throw new HiveException ("Expected number of children is 1. Found : " + childOperatorsArray.length) ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:180 This should not be required. You can always get all the values of enum by using valueOf() method on enum. ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java:229 It will be good to add javadoc for this explaining why we should leave it as it is? ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:45 It will be good to add javadoc for this class. REVISION DETAIL https://reviews.facebook.net/D11097 BRANCH HIVE-2206 -3671-20130711 ARCANIST PROJECT hive To: JIRA, ashutoshc, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          • Merge remote-tracking branch 'upstream/trunk' into HIVE-2206-3671-20130711
          • Left semi join should be handled in analyzeReduceSinkOperatorsOfJoinOperator. Also, use instanceof instead of using the operator's name to check the type of an Operator.

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=35661&id=35721#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer10.q
          ql/src/test/queries/clientpositive/correlationoptimizer11.q
          ql/src/test/queries/clientpositive/correlationoptimizer12.q
          ql/src/test/queries/clientpositive/correlationoptimizer13.q
          ql/src/test/queries/clientpositive/correlationoptimizer14.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/queries/clientpositive/correlationoptimizer7.q
          ql/src/test/queries/clientpositive/correlationoptimizer8.q
          ql/src/test/queries/clientpositive/correlationoptimizer9.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer10.q.out
          ql/src/test/results/clientpositive/correlationoptimizer11.q.out
          ql/src/test/results/clientpositive/correlationoptimizer12.q.out
          ql/src/test/results/clientpositive/correlationoptimizer13.q.out
          ql/src/test/results/clientpositive/correlationoptimizer14.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Merge remote-tracking branch 'upstream/trunk' into HIVE-2206 -3671-20130711 Left semi join should be handled in analyzeReduceSinkOperatorsOfJoinOperator. Also, use instanceof instead of using the operator's name to check the type of an Operator. Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=35661&id=35721#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer10.q ql/src/test/queries/clientpositive/correlationoptimizer11.q ql/src/test/queries/clientpositive/correlationoptimizer12.q ql/src/test/queries/clientpositive/correlationoptimizer13.q ql/src/test/queries/clientpositive/correlationoptimizer14.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/queries/clientpositive/correlationoptimizer7.q ql/src/test/queries/clientpositive/correlationoptimizer8.q ql/src/test/queries/clientpositive/correlationoptimizer9.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer10.q.out ql/src/test/results/clientpositive/correlationoptimizer11.q.out ql/src/test/results/clientpositive/correlationoptimizer12.q.out ql/src/test/results/clientpositive/correlationoptimizer13.q.out ql/src/test/results/clientpositive/correlationoptimizer14.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out ql/src/test/results/clientpositive/correlationoptimizer7.q.out ql/src/test/results/clientpositive/correlationoptimizer8.q.out ql/src/test/results/clientpositive/correlationoptimizer9.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          • Since hive already uses a single scan for a table with multiple aliases in a MR job, we can remove unnecessary code on merging TableScanOperators

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=35487&id=35661#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer10.q
          ql/src/test/queries/clientpositive/correlationoptimizer11.q
          ql/src/test/queries/clientpositive/correlationoptimizer12.q
          ql/src/test/queries/clientpositive/correlationoptimizer13.q
          ql/src/test/queries/clientpositive/correlationoptimizer14.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/queries/clientpositive/correlationoptimizer7.q
          ql/src/test/queries/clientpositive/correlationoptimizer8.q
          ql/src/test/queries/clientpositive/correlationoptimizer9.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer10.q.out
          ql/src/test/results/clientpositive/correlationoptimizer11.q.out
          ql/src/test/results/clientpositive/correlationoptimizer12.q.out
          ql/src/test/results/clientpositive/correlationoptimizer13.q.out
          ql/src/test/results/clientpositive/correlationoptimizer14.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Since hive already uses a single scan for a table with multiple aliases in a MR job, we can remove unnecessary code on merging TableScanOperators Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=35487&id=35661#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer10.q ql/src/test/queries/clientpositive/correlationoptimizer11.q ql/src/test/queries/clientpositive/correlationoptimizer12.q ql/src/test/queries/clientpositive/correlationoptimizer13.q ql/src/test/queries/clientpositive/correlationoptimizer14.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/queries/clientpositive/correlationoptimizer7.q ql/src/test/queries/clientpositive/correlationoptimizer8.q ql/src/test/queries/clientpositive/correlationoptimizer9.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer10.q.out ql/src/test/results/clientpositive/correlationoptimizer11.q.out ql/src/test/results/clientpositive/correlationoptimizer12.q.out ql/src/test/results/clientpositive/correlationoptimizer13.q.out ql/src/test/results/clientpositive/correlationoptimizer14.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out ql/src/test/results/clientpositive/correlationoptimizer7.q.out ql/src/test/results/clientpositive/correlationoptimizer8.q.out ql/src/test/results/clientpositive/correlationoptimizer9.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          • update test cases
          • refactor code

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=35283&id=35487#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer10.q
          ql/src/test/queries/clientpositive/correlationoptimizer11.q
          ql/src/test/queries/clientpositive/correlationoptimizer12.q
          ql/src/test/queries/clientpositive/correlationoptimizer13.q
          ql/src/test/queries/clientpositive/correlationoptimizer14.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/queries/clientpositive/correlationoptimizer7.q
          ql/src/test/queries/clientpositive/correlationoptimizer8.q
          ql/src/test/queries/clientpositive/correlationoptimizer9.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer10.q.out
          ql/src/test/results/clientpositive/correlationoptimizer11.q.out
          ql/src/test/results/clientpositive/correlationoptimizer12.q.out
          ql/src/test/results/clientpositive/correlationoptimizer13.q.out
          ql/src/test/results/clientpositive/correlationoptimizer14.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". update test cases refactor code Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=35283&id=35487#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer10.q ql/src/test/queries/clientpositive/correlationoptimizer11.q ql/src/test/queries/clientpositive/correlationoptimizer12.q ql/src/test/queries/clientpositive/correlationoptimizer13.q ql/src/test/queries/clientpositive/correlationoptimizer14.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/queries/clientpositive/correlationoptimizer7.q ql/src/test/queries/clientpositive/correlationoptimizer8.q ql/src/test/queries/clientpositive/correlationoptimizer9.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer10.q.out ql/src/test/results/clientpositive/correlationoptimizer11.q.out ql/src/test/results/clientpositive/correlationoptimizer12.q.out ql/src/test/results/clientpositive/correlationoptimizer13.q.out ql/src/test/results/clientpositive/correlationoptimizer14.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out ql/src/test/results/clientpositive/correlationoptimizer7.q.out ql/src/test/results/clientpositive/correlationoptimizer8.q.out ql/src/test/results/clientpositive/correlationoptimizer9.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Shane Pratt added a comment -

          Thank you for your message.

          I am out of the office on vacation for the remainder of the week. If this is an emergency, please call me at the number below.

          Otherwise, I'll respond to your message when I return.

          Shane
          512-590-3925

          Show
          Shane Pratt added a comment - Thank you for your message. I am out of the office on vacation for the remainder of the week. If this is an emergency, please call me at the number below. Otherwise, I'll respond to your message when I return. Shane 512-590-3925
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          • Correlation optimizer currently does not support PTF operator

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=35265&id=35283#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer10.q
          ql/src/test/queries/clientpositive/correlationoptimizer11.q
          ql/src/test/queries/clientpositive/correlationoptimizer12.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/queries/clientpositive/correlationoptimizer7.q
          ql/src/test/queries/clientpositive/correlationoptimizer8.q
          ql/src/test/queries/clientpositive/correlationoptimizer9.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer10.q.out
          ql/src/test/results/clientpositive/correlationoptimizer11.q.out
          ql/src/test/results/clientpositive/correlationoptimizer12.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Correlation optimizer currently does not support PTF operator Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=35265&id=35283#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer10.q ql/src/test/queries/clientpositive/correlationoptimizer11.q ql/src/test/queries/clientpositive/correlationoptimizer12.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/queries/clientpositive/correlationoptimizer7.q ql/src/test/queries/clientpositive/correlationoptimizer8.q ql/src/test/queries/clientpositive/correlationoptimizer9.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer10.q.out ql/src/test/results/clientpositive/correlationoptimizer11.q.out ql/src/test/results/clientpositive/correlationoptimizer12.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out ql/src/test/results/clientpositive/correlationoptimizer7.q.out ql/src/test/results/clientpositive/correlationoptimizer8.q.out ql/src/test/results/clientpositive/correlationoptimizer9.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          • Merge remote-tracking branch 'upstream/trunk' into HIVE-2206-3671-Improvement
          • update test cases
          • refactoring.
          • Merge remote-tracking branch 'upstream/trunk' into HIVE-2206-3671-Improvement

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=35223&id=35265#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer10.q
          ql/src/test/queries/clientpositive/correlationoptimizer11.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/queries/clientpositive/correlationoptimizer7.q
          ql/src/test/queries/clientpositive/correlationoptimizer8.q
          ql/src/test/queries/clientpositive/correlationoptimizer9.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer10.q.out
          ql/src/test/results/clientpositive/correlationoptimizer11.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Merge remote-tracking branch 'upstream/trunk' into HIVE-2206 -3671-Improvement update test cases refactoring. Merge remote-tracking branch 'upstream/trunk' into HIVE-2206 -3671-Improvement Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=35223&id=35265#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer10.q ql/src/test/queries/clientpositive/correlationoptimizer11.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/queries/clientpositive/correlationoptimizer7.q ql/src/test/queries/clientpositive/correlationoptimizer8.q ql/src/test/queries/clientpositive/correlationoptimizer9.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer10.q.out ql/src/test/results/clientpositive/correlationoptimizer11.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out ql/src/test/results/clientpositive/correlationoptimizer7.q.out ql/src/test/results/clientpositive/correlationoptimizer8.q.out ql/src/test/results/clientpositive/correlationoptimizer9.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          Handle partitioned tables

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=35193&id=35223#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeConstantDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer10.q
          ql/src/test/queries/clientpositive/correlationoptimizer11.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/queries/clientpositive/correlationoptimizer7.q
          ql/src/test/queries/clientpositive/correlationoptimizer8.q
          ql/src/test/queries/clientpositive/correlationoptimizer9.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer10.q.out
          ql/src/test/results/clientpositive/correlationoptimizer11.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Handle partitioned tables Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=35193&id=35223#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeConstantDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer10.q ql/src/test/queries/clientpositive/correlationoptimizer11.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/queries/clientpositive/correlationoptimizer7.q ql/src/test/queries/clientpositive/correlationoptimizer8.q ql/src/test/queries/clientpositive/correlationoptimizer9.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer10.q.out ql/src/test/results/clientpositive/correlationoptimizer11.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out ql/src/test/results/clientpositive/correlationoptimizer7.q.out ql/src/test/results/clientpositive/correlationoptimizer8.q.out ql/src/test/results/clientpositive/correlationoptimizer9.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          My last diff was for 4718...

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=35181&id=35193#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer10.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/queries/clientpositive/correlationoptimizer7.q
          ql/src/test/queries/clientpositive/correlationoptimizer8.q
          ql/src/test/queries/clientpositive/correlationoptimizer9.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer10.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". My last diff was for 4718... Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=35181&id=35193#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer10.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/queries/clientpositive/correlationoptimizer7.q ql/src/test/queries/clientpositive/correlationoptimizer8.q ql/src/test/queries/clientpositive/correlationoptimizer9.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer10.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out ql/src/test/results/clientpositive/correlationoptimizer7.q.out ql/src/test/results/clientpositive/correlationoptimizer8.q.out ql/src/test/results/clientpositive/correlationoptimizer9.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          I tested all unit tests before the commit of HIVE-4496. all unit tests pass

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=35055&id=35181#toc

          AFFECTED FILES
          build-common.xml
          data/files/leftsemijoin_mr_t1.txt
          data/files/leftsemijoin_mr_t2.txt
          ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java
          ql/src/test/queries/clientpositive/leftsemijoin_mr.q
          ql/src/test/results/clientpositive/leftsemijoin_mr.q.out

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". I tested all unit tests before the commit of HIVE-4496 . all unit tests pass Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=35055&id=35181#toc AFFECTED FILES build-common.xml data/files/leftsemijoin_mr_t1.txt data/files/leftsemijoin_mr_t2.txt ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java ql/src/test/queries/clientpositive/leftsemijoin_mr.q ql/src/test/results/clientpositive/leftsemijoin_mr.q.out To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          • fix bugs related to (1) semi join and (2) multiple join and the DemuxOperator directly connects to a MuxOperator
          • update

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=34869&id=35055#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer10.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/queries/clientpositive/correlationoptimizer7.q
          ql/src/test/queries/clientpositive/correlationoptimizer8.q
          ql/src/test/queries/clientpositive/correlationoptimizer9.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer10.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". fix bugs related to (1) semi join and (2) multiple join and the DemuxOperator directly connects to a MuxOperator update Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=34869&id=35055#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer10.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/queries/clientpositive/correlationoptimizer7.q ql/src/test/queries/clientpositive/correlationoptimizer8.q ql/src/test/queries/clientpositive/correlationoptimizer9.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer10.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out ql/src/test/results/clientpositive/correlationoptimizer7.q.out ql/src/test/results/clientpositive/correlationoptimizer8.q.out ql/src/test/results/clientpositive/correlationoptimizer9.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          a few minor updates

          • add new tests
          • remove unused methods
          • Merge remote-tracking branch 'upstream/trunk' into HIVE-2206-3671-Improvement
          • add Apache license header
          • evaluate if this operator is UnionOperator first

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=34791&id=34869#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/queries/clientpositive/correlationoptimizer7.q
          ql/src/test/queries/clientpositive/correlationoptimizer8.q
          ql/src/test/queries/clientpositive/correlationoptimizer9.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". a few minor updates add new tests remove unused methods Merge remote-tracking branch 'upstream/trunk' into HIVE-2206 -3671-Improvement add Apache license header evaluate if this operator is UnionOperator first Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=34791&id=34869#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/queries/clientpositive/correlationoptimizer7.q ql/src/test/queries/clientpositive/correlationoptimizer8.q ql/src/test/queries/clientpositive/correlationoptimizer9.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out ql/src/test/results/clientpositive/correlationoptimizer7.q.out ql/src/test/results/clientpositive/correlationoptimizer8.q.out ql/src/test/results/clientpositive/correlationoptimizer9.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          All tests pass

          • new tests
          • Fix a bug related to UnionOperator;
          • wip
          • more comments in tests
          • Merge remote-tracking branch 'upstream/trunk' into HIVE-2206-3671-Improvement
          • update comments
          • Merge remote-tracking branch 'upstream/trunk' into HIVE-2206-3671-Improvement

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=34653&id=34791#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/queries/clientpositive/correlationoptimizer7.q
          ql/src/test/queries/clientpositive/correlationoptimizer8.q
          ql/src/test/queries/clientpositive/correlationoptimizer9.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          ql/src/test/results/clientpositive/correlationoptimizer7.q.out
          ql/src/test/results/clientpositive/correlationoptimizer8.q.out
          ql/src/test/results/clientpositive/correlationoptimizer9.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". All tests pass new tests Fix a bug related to UnionOperator; wip more comments in tests Merge remote-tracking branch 'upstream/trunk' into HIVE-2206 -3671-Improvement update comments Merge remote-tracking branch 'upstream/trunk' into HIVE-2206 -3671-Improvement Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=34653&id=34791#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/queries/clientpositive/correlationoptimizer7.q ql/src/test/queries/clientpositive/correlationoptimizer8.q ql/src/test/queries/clientpositive/correlationoptimizer9.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out ql/src/test/results/clientpositive/correlationoptimizer7.q.out ql/src/test/results/clientpositive/correlationoptimizer8.q.out ql/src/test/results/clientpositive/correlationoptimizer9.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          • Merge remote-tracking branch 'upstream/trunk' into HIVE-2206-3671-Improvement
          • make optimized plans deterministic
          • add a new test and update results

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=34623&id=34653#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Merge remote-tracking branch 'upstream/trunk' into HIVE-2206 -3671-Improvement make optimized plans deterministic add a new test and update results Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=34623&id=34653#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          • fix the mechanism of processing a key group
          • bug fix
          • refactoring code. make output of optimized plans deterministic. add new
          • use spaces for indentation.

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=34413&id=34623#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". fix the mechanism of processing a key group bug fix refactoring code. make output of optimized plans deterministic. add new use spaces for indentation. Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=34413&id=34623#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          • check isTraceEnabled before constructing a trace message

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=34401&id=34413#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". check isTraceEnabled before constructing a trace message Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=34401&id=34413#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          address brock's comments

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=34383&id=34401#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". address brock's comments Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=34383&id=34401#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          brock has commented on the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java:61 Looks like it's there because ArrayList defines a clone() method.
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1088 I agree that hive does this often. I don't mean to suggest you should fix this in all of hive in the patch but let's not add any additional printStackTraces. I see one additional new printStackTrace in your patch. Would you mind removing that one as well?

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - brock has commented on the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java:61 Looks like it's there because ArrayList defines a clone() method. ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1088 I agree that hive does this often. I don't mean to suggest you should fix this in all of hive in the patch but let's not add any additional printStackTraces. I see one additional new printStackTrace in your patch. Would you mind removing that one as well? REVISION DETAIL https://reviews.facebook.net/D11097 To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          yhuai has commented on the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1088 Did not notice it before. I copied the code of closeOp. I do not think we need to print the exception. I will change this class. Also, seems printing the exception also appear in lots of other places. If we want to need to remove all of them, we need a separate jira.
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java:61 MuxOperator is used to replace ReduceSinkOperators in an MR job optimized by this optimizer. I basically follow the code of ReduceSinkDesc. Seems the reason that ArrayList is used is for clone. I will leave ArrayList at here right now.

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - yhuai has commented on the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1088 Did not notice it before. I copied the code of closeOp. I do not think we need to print the exception. I will change this class. Also, seems printing the exception also appear in lots of other places. If we want to need to remove all of them, we need a separate jira. ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java:61 MuxOperator is used to replace ReduceSinkOperators in an MR job optimized by this optimizer. I basically follow the code of ReduceSinkDesc. Seems the reason that ArrayList is used is for clone. I will leave ArrayList at here right now. REVISION DETAIL https://reviews.facebook.net/D11097 To: JIRA, yhuai Cc: brock
          Hide
          Phabricator added a comment -

          brock has commented on the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          I was just casually reading this patch and noted a few items.

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1088 If we are throwing the exception do we need to print the exception? Also, this should be logged not printed.
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java:61 We should be returning list of collection not arraylist no? There are a few other occurrences of this.

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          To: JIRA, yhuai
          Cc: brock

          Show
          Phabricator added a comment - brock has commented on the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". I was just casually reading this patch and noted a few items. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1088 If we are throwing the exception do we need to print the exception? Also, this should be logged not printed. ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java:61 We should be returning list of collection not arraylist no? There are a few other occurrences of this. REVISION DETAIL https://reviews.facebook.net/D11097 To: JIRA, yhuai Cc: brock
          Hide
          Yin Huai added a comment -

          HIVE-2206.D11097.3.patch is the latest patch. I have fixed bugs found in unit tests when the optimizer is enabled by default. The patch is ready for review.

          Show
          Yin Huai added a comment - HIVE-2206 .D11097.3.patch is the latest patch. I have fixed bugs found in unit tests when the optimizer is enabled by default. The patch is ready for review.
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          Have fixed bugs I found from unit tests when the optimizer is enabled by default.
          Also, refactored the code and updated test results.

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=34329&id=34383#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Have fixed bugs I found from unit tests when the optimizer is enabled by default. Also, refactored the code and updated test results. Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=34329&id=34383#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai
          Hide
          Yin Huai added a comment -

          update the diff at https://reviews.facebook.net/D11097. Fixed two bugs. All unit test pass when the optimizer is turned off by default. I am evaluating if there is any issue when the optimizer is turned on by default.

          Show
          Yin Huai added a comment - update the diff at https://reviews.facebook.net/D11097 . Fixed two bugs. All unit test pass when the optimizer is turned off by default. I am evaluating if there is any issue when the optimizer is turned on by default.
          Hide
          Phabricator added a comment -

          yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          • Merge remote-tracking branch 'upstream/trunk' into HIVE-2206-3671-Refactoring
          • Merge branch 'HIVE-2206-3671-Refactoring' of https://github.com/yhuai/hive into HIVE-2206-3671-Refactoring
          • end group is not called correctly
          • reorganize testcases
          • Merge remote-tracking branch 'upstream/trunk' into HIVE-2206-3671-Refactoring
          • Bug fix. The logic of comparing if a table is not good to be a big table
          • update results

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D11097?vs=34293&id=34329#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/compiler/plan/groupby2.q.xml
          ql/src/test/results/compiler/plan/groupby3.q.xml

          To: JIRA, yhuai

          Show
          Phabricator added a comment - yhuai updated the revision " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Merge remote-tracking branch 'upstream/trunk' into HIVE-2206 -3671-Refactoring Merge branch ' HIVE-2206 -3671-Refactoring' of https://github.com/yhuai/hive into HIVE-2206 -3671-Refactoring end group is not called correctly reorganize testcases Merge remote-tracking branch 'upstream/trunk' into HIVE-2206 -3671-Refactoring Bug fix. The logic of comparing if a table is not good to be a big table update results Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=34293&id=34329#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai
          Hide
          Yin Huai added a comment -

          Just found I need to set false for both hive.auto.convert.join and hive.auto.convert.join.noconditionaltask to let RS dedup to work on cases with join. I just tried two cases. It works on

          SELECT x.key AS key, count(1) AS cnt FROM src1 x JOIN src y ON (x.key = y.key) GROUP BY x.key
          

          , and it does work on

          SELECT xx.key, xx.cnt, yy.key, yy.cnt
          FROM
          (SELECT x.a as key, count(*) AS cnt FROM src x group by x.a) xx
          JOIN
          (SELECT y.a as key, count(*) AS cnt FROM src1 y group by y.a) yy
          ON (xx.key=yy.key);
          

          I suggest that we let CorrelationOptimizer to handle cases involving join because it supports more cases and has included needed mechanisms.

          Show
          Yin Huai added a comment - Just found I need to set false for both hive.auto.convert.join and hive.auto.convert.join.noconditionaltask to let RS dedup to work on cases with join. I just tried two cases. It works on SELECT x.key AS key, count(1) AS cnt FROM src1 x JOIN src y ON (x.key = y.key) GROUP BY x.key , and it does work on SELECT xx.key, xx.cnt, yy.key, yy.cnt FROM (SELECT x.a as key, count(*) AS cnt FROM src x group by x.a) xx JOIN (SELECT y.a as key, count(*) AS cnt FROM src1 y group by y.a) yy ON (xx.key=yy.key); I suggest that we let CorrelationOptimizer to handle cases involving join because it supports more cases and has included needed mechanisms.
          Hide
          Yin Huai added a comment -

          RS dedup is on by default. So the explain without CorrelationOptimizer should be optimized by RS dedup. But, seems that it does not fire in any of my cases. Will take a look at it later.

          Show
          Yin Huai added a comment - RS dedup is on by default. So the explain without CorrelationOptimizer should be optimized by RS dedup. But, seems that it does not fire in any of my cases. Will take a look at it later.
          Hide
          Ashutosh Chauhan added a comment -

          In your testcases, some of the patterns you have (e.g., like Join followed by GBY) on same keys, I assume reducesink reduplication optimization will already take care of it such that it will generate only 1 MR job. Is that correct? Is it that for all of your testcases reducesink dedup optimization will not fire. If its former, than it will be good to identify which of those cases are already taken care by RS dedup. If its latter, than it will be good to know why reducesink dedup optimization is not kicking in for those.

          Show
          Ashutosh Chauhan added a comment - In your testcases, some of the patterns you have (e.g., like Join followed by GBY) on same keys, I assume reducesink reduplication optimization will already take care of it such that it will generate only 1 MR job. Is that correct? Is it that for all of your testcases reducesink dedup optimization will not fire. If its former, than it will be good to identify which of those cases are already taken care by RS dedup. If its latter, than it will be good to know why reducesink dedup optimization is not kicking in for those.
          Hide
          Yin Huai added a comment -

          HIVE-2206.D11097.1.patch is the latest patch for the trunk. I have heavily refactored my code. Here are major changes.

          1. If multiple operation paths share the same input table, I just use a single TableScanOperator and add the bottom operators of these paths as children of this common TableScanOperator. I do not do any deduplication of common columns because deduplication will significantly make the code more complicated and may introduce more problems. If we want to do deduplication, I suggest to tackle it later in a followup work.
          2. Without deduplicating columns, the dispatcher at the reduce side has less work to do and some queries involving self join can be optimized in the current version.
          3. The fake ReduceSinkOperator (CorrelationLocalSimulativeReduceSinkOperator... I will change the name later) does not do serialization and deserialization as appearing in the previous one.
          4. New test cases are added.
          5. I also refactor the code ReduceSinkDeDupplication since CorrelationOptimizer can reuse some methods introduced by ReduceSinkDeDupplication. Navis can you take a look at it and see if my changes make sense?

          I will run all unit tests soon and will also add more comments.

          btw, there is a issue in correlationoptimizer2.q. Optimized plans cannot generate rows that both join keys (from the left table and right table) are null values for outer joins. I am looking at it

          Show
          Yin Huai added a comment - HIVE-2206 .D11097.1.patch is the latest patch for the trunk. I have heavily refactored my code. Here are major changes. If multiple operation paths share the same input table, I just use a single TableScanOperator and add the bottom operators of these paths as children of this common TableScanOperator. I do not do any deduplication of common columns because deduplication will significantly make the code more complicated and may introduce more problems. If we want to do deduplication, I suggest to tackle it later in a followup work. Without deduplicating columns, the dispatcher at the reduce side has less work to do and some queries involving self join can be optimized in the current version. The fake ReduceSinkOperator (CorrelationLocalSimulativeReduceSinkOperator... I will change the name later) does not do serialization and deserialization as appearing in the previous one. New test cases are added. I also refactor the code ReduceSinkDeDupplication since CorrelationOptimizer can reuse some methods introduced by ReduceSinkDeDupplication. Navis can you take a look at it and see if my changes make sense? I will run all unit tests soon and will also add more comments. btw, there is a issue in correlationoptimizer2.q. Optimized plans cannot generate rows that both join keys (from the left table and right table) are null values for outer joins. I am looking at it
          Hide
          Phabricator added a comment -

          yhuai requested code review of "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization".

          Reviewers: JIRA

          update test results

          This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The paper and slides of YSmart are linked at the bottom.

          Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.

          Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
          Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
          Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.

          The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.

          There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
          All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and
          No self join is involved in those correlated MR jobs.

          Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.

          Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers.

          There are several work that can be done in future to improve this optimizer. Here are three examples.

          Support queries only involve TC;
          Support queries in which input tables of correlated MR jobs involves intermediate tables; and
          Optimize queries involving self join.

          References:
          Paper and presentation of YSmart.
          Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
          Slides: http://sdrv.ms/UpwJJc

          TEST PLAN
          EMPTY

          REVISION DETAIL
          https://reviews.facebook.net/D11097

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/if/queryplan.thrift
          ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/correlationoptimizer1.q
          ql/src/test/queries/clientpositive/correlationoptimizer2.q
          ql/src/test/queries/clientpositive/correlationoptimizer3.q
          ql/src/test/queries/clientpositive/correlationoptimizer4.q
          ql/src/test/queries/clientpositive/correlationoptimizer5.q
          ql/src/test/queries/clientpositive/correlationoptimizer6.q
          ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          ql/src/test/results/clientpositive/correlationoptimizer6.q.out

          MANAGE HERALD RULES
          https://reviews.facebook.net/herald/view/differential/

          WHY DID I GET THIS EMAIL?
          https://reviews.facebook.net/herald/transcript/26439/

          To: JIRA, yhuai

          Show
          Phabricator added a comment - yhuai requested code review of " HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". Reviewers: JIRA update test results This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart ( http://ysmart.cse.ohio-state.edu/ ). The paper and slides of YSmart are linked at the bottom. Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job. Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint; Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key; Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node. The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions. There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists); All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and No self join is involved in those correlated MR jobs. Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs. Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. There are several work that can be done in future to improve this optimizer. Here are three examples. Support queries only involve TC; Support queries in which input tables of correlated MR jobs involves intermediate tables; and Optimize queries involving self join. References: Paper and presentation of YSmart. Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf Slides: http://sdrv.ms/UpwJJc TEST PLAN EMPTY REVISION DETAIL https://reviews.facebook.net/D11097 AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/queries/clientpositive/correlationoptimizer6.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/clientpositive/correlationoptimizer6.q.out MANAGE HERALD RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/26439/ To: JIRA, yhuai
          Hide
          Yin Huai added a comment -

          Ashutosh Chauhan have you got a time to look at the patch?

          Show
          Yin Huai added a comment - Ashutosh Chauhan have you got a time to look at the patch?
          Hide
          Ashutosh Chauhan added a comment -

          Oh.. I see that you have already updated RB and jira. I will take a look at it soon.

          Show
          Ashutosh Chauhan added a comment - Oh.. I see that you have already updated RB and jira. I will take a look at it soon.
          Hide
          Ashutosh Chauhan added a comment -

          I am having second thoughts on cloning. Cloning graphs (like query plan) or dense structures (like ParseContext) is fraught with perils. Its likely that cloning will require new code and arguably have hard to detect bugs, since we need to track down every single pointer and clone all the way through. I think to avoid such issues and for simplicity, we can drop the cloning idea. The feature is anyway behind the config option which is off default, so query-plan will be modified only for the users who turn the flag on.
          Yin, if you have addressed my other comments, can you update the patch on RB and upload here on jira, I will take another look at it.

          Show
          Ashutosh Chauhan added a comment - I am having second thoughts on cloning. Cloning graphs (like query plan) or dense structures (like ParseContext) is fraught with perils. Its likely that cloning will require new code and arguably have hard to detect bugs, since we need to track down every single pointer and clone all the way through. I think to avoid such issues and for simplicity, we can drop the cloning idea. The feature is anyway behind the config option which is off default, so query-plan will be modified only for the users who turn the flag on. Yin, if you have addressed my other comments, can you update the patch on RB and upload here on jira, I will take another look at it.
          Hide
          Ashutosh Chauhan added a comment -

          I ran my tests with HIVE-3784 and I got a single MR job for my query (i.e., mapjoin followed by group-by on different keys) gets you a single MR job. Thats cool.

          Show
          Ashutosh Chauhan added a comment - I ran my tests with HIVE-3784 and I got a single MR job for my query (i.e., mapjoin followed by group-by on different keys) gets you a single MR job. Thats cool.
          Hide
          Yin Huai added a comment -

          Ashutosh Chauhan The only thing I have not addressed is cloning the plan at the beginning of optimization work. Seems that we only can clone a part of the query plan tree. If we backup only a part of the query plan tree, I think it is the same as the current patch (described below), since in this way, we still need to link the backup part to other parts if correlation optimizer cannot optimize the given query.

          In my patch, the original plan tree will be changed to a version without map-side aggregations, firstly. This version of the plan tree will be only changed if there is any correlation detected and new merged Map-side plan and reduce side dispatch have been successfully generated (this query can be optimized). If optimizer cannot optimize this plan, it will change the plan back to the original one.

          Any suggestion on how to address this issue? Or, did I miss any thing which can facilitate the cloning work? I think it will be great if we can clone a ParseContext, so if a optimization is failed, we still have a copy of the original plan.

          Show
          Yin Huai added a comment - Ashutosh Chauhan The only thing I have not addressed is cloning the plan at the beginning of optimization work. Seems that we only can clone a part of the query plan tree. If we backup only a part of the query plan tree, I think it is the same as the current patch (described below), since in this way, we still need to link the backup part to other parts if correlation optimizer cannot optimize the given query. In my patch, the original plan tree will be changed to a version without map-side aggregations, firstly. This version of the plan tree will be only changed if there is any correlation detected and new merged Map-side plan and reduce side dispatch have been successfully generated (this query can be optimized). If optimizer cannot optimize this plan, it will change the plan back to the original one. Any suggestion on how to address this issue? Or, did I miss any thing which can facilitate the cloning work? I think it will be great if we can clone a ParseContext, so if a optimization is failed, we still have a copy of the original plan.
          Hide
          Yin Huai added a comment -

          So, if a Map join is involved in a plan and the output of this join will be consumed by another subsequent operator, we should leave the Map join in the Map phase of the subsequent operator instead of cutting Map join to a separate MR job. Is my understanding correct? If so, I think that we should change the process on generating Map join Operators. After generating a Map join operator, we do not insert a FileSinkOperator after the Map Join. Seems that Rule 11 in org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genMapRedTasks is the rule for generating a separate MR job for a Map Join.

          Show
          Yin Huai added a comment - So, if a Map join is involved in a plan and the output of this join will be consumed by another subsequent operator, we should leave the Map join in the Map phase of the subsequent operator instead of cutting Map join to a separate MR job. Is my understanding correct? If so, I think that we should change the process on generating Map join Operators. After generating a Map join operator, we do not insert a FileSinkOperator after the Map Join. Seems that Rule 11 in org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genMapRedTasks is the rule for generating a separate MR job for a Map Join.
          Hide
          Ashutosh Chauhan added a comment -

          I did some testing of this on our use-cases. Lets say you have two tables:

          create table t1 (a int, b int);
          create table t2 (c int, d int);
          select a from ( select * from t1 join t2 on (t1.a = t2.c) )e group by a;
          select a from ( select/*+ MAPJOIN(t2) */ * from t1 join t2 on (t1.a = t2.c) )e group by a;
          

          Now, ysmart is able to optimize first query fine, where it fuses 2MR jobs in 1 MR, since join and group-by has same key.
          However this doesn't work with 2nd query which has mapjoin. This results in 2 MR job. This is especially important since in map-join case you don't need the condition of join-key being same as groupby key, which is very important. In our use-cases, we have observed its rarely the case that join and group-by is on same key. But, in most cases we are able to utilize map-join, since data which we are joining on is small enough. And than subsequent group-by which is on different key can happen on the reduce side of this single MR job.
          Any thoughts on how this could be achieved ?
          Though, it looks like for this use-case we don't need ysmart since we don't need to detect any correlation. We can walk on query plan and can always fuse map-side join followed by group-by (i.e., no need to detect any correlation among inputs or to detect whether join and group-by keys are same).

          Show
          Ashutosh Chauhan added a comment - I did some testing of this on our use-cases. Lets say you have two tables: create table t1 (a int , b int ); create table t2 (c int , d int ); select a from ( select * from t1 join t2 on (t1.a = t2.c) )e group by a; select a from ( select/*+ MAPJOIN(t2) */ * from t1 join t2 on (t1.a = t2.c) )e group by a; Now, ysmart is able to optimize first query fine, where it fuses 2MR jobs in 1 MR, since join and group-by has same key. However this doesn't work with 2nd query which has mapjoin. This results in 2 MR job. This is especially important since in map-join case you don't need the condition of join-key being same as groupby key, which is very important. In our use-cases, we have observed its rarely the case that join and group-by is on same key. But, in most cases we are able to utilize map-join, since data which we are joining on is small enough. And than subsequent group-by which is on different key can happen on the reduce side of this single MR job. Any thoughts on how this could be achieved ? Though, it looks like for this use-case we don't need ysmart since we don't need to detect any correlation. We can walk on query plan and can always fuse map-side join followed by group-by (i.e., no need to detect any correlation among inputs or to detect whether join and group-by keys are same).
          Hide
          Ashutosh Chauhan added a comment -

          If Yin wants to provide a patch against a stable (or any) branch, thats his choice. But, for patch to get committed it needs to get committed on trunk first.

          Show
          Ashutosh Chauhan added a comment - If Yin wants to provide a patch against a stable (or any) branch, thats his choice. But, for patch to get committed it needs to get committed on trunk first.
          Hide
          David Inbar added a comment -

          I will be on vacation through January 14th, but will be checking email and voicemail periodically.

          For all time-critical items, please call my mobile phone.

          Many thanks,
          David

          NOTICE: All information in and attached to this email may be proprietary, confidential, privileged and otherwise protected from improper or erroneous disclosure. If you are not the sender's intended recipient, you are not authorized to intercept, read, print, retain, copy, forward, or disseminate this message.

          Show
          David Inbar added a comment - I will be on vacation through January 14th, but will be checking email and voicemail periodically. For all time-critical items, please call my mobile phone. Many thanks, David NOTICE: All information in and attached to this email may be proprietary, confidential, privileged and otherwise protected from improper or erroneous disclosure. If you are not the sender's intended recipient, you are not authorized to intercept, read, print, retain, copy, forward, or disseminate this message.
          Hide
          Liu Zongquan added a comment -

          Yin Huai I have a question that why not release a patch upon a stable hive release, e.g,branch hive-0.8-r2. Actually I found that the r1410581 is not a stable revision and even I can't run through "ant test -Dtestcase=TestCliDriver -Dqfile=show_functions.q -Doverwrite=true" on this revision. So, if this patch is based on a stable version, espectially a stable branch, then your honor work will benefit more people. Even so ,just a suggestion.

          Show
          Liu Zongquan added a comment - Yin Huai I have a question that why not release a patch upon a stable hive release, e.g,branch hive-0.8-r2. Actually I found that the r1410581 is not a stable revision and even I can't run through "ant test -Dtestcase=TestCliDriver -Dqfile=show_functions.q -Doverwrite=true" on this revision. So, if this patch is based on a stable version, espectially a stable branch, then your honor work will benefit more people. Even so ,just a suggestion.
          Hide
          Yin Huai added a comment -

          Ashutosh Chauhan Yes. We can extend correlation optimizer to optimize shared scans. Since shared scans can be considered as a kind of "correlation", I think it will be good to optimize those cases in correlation optimizer. Also, thanks for your comments. I will address those asap.

          Show
          Yin Huai added a comment - Ashutosh Chauhan Yes. We can extend correlation optimizer to optimize shared scans. Since shared scans can be considered as a kind of "correlation", I think it will be good to optimize those cases in correlation optimizer. Also, thanks for your comments. I will address those asap.
          Hide
          Ashutosh Chauhan added a comment -

          Also, can this work enable or facilitate implementation of optimization which is getting discussed on HIVE-3773 ?

          Show
          Ashutosh Chauhan added a comment - Also, can this work enable or facilitate implementation of optimization which is getting discussed on HIVE-3773 ?
          Hide
          Ashutosh Chauhan added a comment -

          This patch looks useful, especially since once we have this in, it will open up other optimization possibilities. I have left some comment on https://reviews.apache.org/r/7126/

          Show
          Ashutosh Chauhan added a comment - This patch looks useful, especially since once we have this in, it will open up other optimization possibilities. I have left some comment on https://reviews.apache.org/r/7126/
          Hide
          Liu Zongquan added a comment -

          Yin Huai Thanks so much!

          Show
          Liu Zongquan added a comment - Yin Huai Thanks so much!
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-hadoop2 #54 (See https://builds.apache.org/job/Hive-trunk-hadoop2/54/)
          HIVE-2206:add a new optimizer for query correlation discovery and optimization (Yin Huai via He Yongqiang) (Revision 1392105)

          Result = ABORTED
          heyongqiang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1392105
          Files :

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/conf/hive-default.xml.template
          • /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java
          • /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby1.q.xml
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby5.q.xml
          Show
          Hudson added a comment - Integrated in Hive-trunk-hadoop2 #54 (See https://builds.apache.org/job/Hive-trunk-hadoop2/54/ ) HIVE-2206 :add a new optimizer for query correlation discovery and optimization (Yin Huai via He Yongqiang) (Revision 1392105) Result = ABORTED heyongqiang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1392105 Files : /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/conf/hive-default.xml.template /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out /hive/trunk/ql/src/test/results/compiler/plan/groupby1.q.xml /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml /hive/trunk/ql/src/test/results/compiler/plan/groupby5.q.xml
          Hide
          Yin Huai added a comment -

          Liu Zongquan The latest patch was developed based on hive trunk revision 1410581.

          Show
          Yin Huai added a comment - Liu Zongquan The latest patch was developed based on hive trunk revision 1410581.
          Hide
          Liu Zongquan added a comment -

          If I plan to merge HIVE-2206 into the hive source code, which branch should I use? Can someone tell me?

          Show
          Liu Zongquan added a comment - If I plan to merge HIVE-2206 into the hive source code, which branch should I use? Can someone tell me?
          Hide
          Yin Huai added a comment -

          Carl Steinbach I am not sure if unit tests in Hive are comprehensive enough. If not, it might be better that we turn on this optimizer by default in future after we can use more queries to test it.

          I just tested all unit tests with an enabled correlation optimizer. Because, if map side aggregation is on, correlation optimizer also requires regular reduce side aggregation to be generated, if "cube" or "rollup" is used in the query, error message 10209 (org.apache.hadoop.hive.ql.ErrorMsg.HIVE_GROUPING_SETS_AGGR_NOMAPAGGR) will be thrown. Seems HIVE-3508 can solve this issue. Except this issue, a few query plans need to be re-generated because of changing operator ids.

          This jira has taken a long time. Can we wrap it up and I will start to work on follow-up jiras.

          Show
          Yin Huai added a comment - Carl Steinbach I am not sure if unit tests in Hive are comprehensive enough. If not, it might be better that we turn on this optimizer by default in future after we can use more queries to test it. I just tested all unit tests with an enabled correlation optimizer. Because, if map side aggregation is on, correlation optimizer also requires regular reduce side aggregation to be generated, if "cube" or "rollup" is used in the query, error message 10209 (org.apache.hadoop.hive.ql.ErrorMsg.HIVE_GROUPING_SETS_AGGR_NOMAPAGGR) will be thrown. Seems HIVE-3508 can solve this issue. Except this issue, a few query plans need to be re-generated because of changing operator ids. This jira has taken a long time. Can we wrap it up and I will start to work on follow-up jiras.
          Hide
          Yin Huai added a comment -

          I have not looked at test queries in details, but as far as I can tell, most queries in test cases are simple queries, e.g. queries which do not have sub-queries. Also, there are a few queries which involve correlations, but my current implementation does not cover those cases. For example, current optimizer will not try to optimize cases involving MapJoin or self-join. We can make this optimizer support more cases in future.

          Show
          Yin Huai added a comment - I have not looked at test queries in details, but as far as I can tell, most queries in test cases are simple queries, e.g. queries which do not have sub-queries. Also, there are a few queries which involve correlations, but my current implementation does not cover those cases. For example, current optimizer will not try to optimize cases involving MapJoin or self-join. We can make this optimizer support more cases in future.
          Hide
          Carl Steinbach added a comment -

          I'm surprised that auto_join26 is the only test that fails due to different EXPLAIN output. Is that because this optimization doesn't affect the queries in most tests, or because we don't consistently call EXPLAIN in the tests?

          What is preventing us from enabling this by default right now?

          Show
          Carl Steinbach added a comment - I'm surprised that auto_join26 is the only test that fails due to different EXPLAIN output. Is that because this optimization doesn't affect the queries in most tests, or because we don't consistently call EXPLAIN in the tests? What is preventing us from enabling this by default right now?
          Hide
          Yin Huai added a comment -

          I just integrate HIVE-3671 into this patch. At the beginning of correlation optimizer, it will predict if a join operator will be converted by CommonJoinResolver, if so, correlation optimizer will annotate this join operator and in the future optimization, ignore this operator. The prediction can only be made to those join operators the input tables of which are not intermediate tables. The method of the prediction is ported from CommonJoinResolver. Also, a test is added in correlationoptimizer1.q

          Namit Jain
          Please take a look at this patch. Let me know if you have any comment.

          Show
          Yin Huai added a comment - I just integrate HIVE-3671 into this patch. At the beginning of correlation optimizer, it will predict if a join operator will be converted by CommonJoinResolver, if so, correlation optimizer will annotate this join operator and in the future optimization, ignore this operator. The prediction can only be made to those join operators the input tables of which are not intermediate tables. The method of the prediction is ported from CommonJoinResolver. Also, a test is added in correlationoptimizer1.q Namit Jain Please take a look at this patch. Let me know if you have any comment.
          Hide
          Yin Huai added a comment -

          Carl Steinbach
          If the optimizer is enabled by default, based on my last tests, only auto_join26.q is expected to fail, because it will be optimized by correlation optimizer. But, except the query plan, the query result of auto_join26.q is correct. Also, once I finished HIVE-3671 (I am working on it right now), the failure of auto_join26.q should be eliminated.

          Show
          Yin Huai added a comment - Carl Steinbach If the optimizer is enabled by default, based on my last tests, only auto_join26.q is expected to fail, because it will be optimized by correlation optimizer. But, except the query plan, the query result of auto_join26.q is correct. Also, once I finished HIVE-3671 (I am working on it right now), the failure of auto_join26.q should be eliminated.
          Hide
          David Inbar added a comment -

          I will be on vacation through Friday Nov 23rd, but will be checking email and voicemail periodically.

          For all time-critical items, please call my mobile phone.

          Many thanks,
          David

          NOTICE: All information in and attached to this email may be proprietary, confidential, privileged and otherwise protected from improper or erroneous disclosure. If you are not the sender's intended recipient, you are not authorized to intercept, read, print, retain, copy, forward, or disseminate this message.

          Show
          David Inbar added a comment - I will be on vacation through Friday Nov 23rd, but will be checking email and voicemail periodically. For all time-critical items, please call my mobile phone. Many thanks, David NOTICE: All information in and attached to this email may be proprietary, confidential, privileged and otherwise protected from improper or erroneous disclosure. If you are not the sender's intended recipient, you are not authorized to intercept, read, print, retain, copy, forward, or disseminate this message.
          Hide
          Carl Steinbach added a comment -

          @Yin: The correlation optimizer is only enabled for a small set of new CliDriver tests. If I enable the correlation optimizer by default, which of the existing CliDriver tests are expected to fail?

          Show
          Carl Steinbach added a comment - @Yin: The correlation optimizer is only enabled for a small set of new CliDriver tests. If I enable the correlation optimizer by default, which of the existing CliDriver tests are expected to fail?
          Hide
          Yin Huai added a comment -

          Namit Jain
          Sure. I just took a look at the code. Seems that once I get all content summaries of input table, I can make the guess on if join auto resolver will work for join operators on input tables. Because, as far as I know, existing util functions on retrieving content summaries (called after logical optimization) cannot be used directly at here, I need to write some util functions to get sizes of input tables. I will start to work on this asap. Also, although HIVE-3671 seems not hard to do, but it is not a quick fix. I suggest we track this work in a separate jira.

          Carl Steinbach
          Have you got time to look at current patch? Any comment?

          Show
          Yin Huai added a comment - Namit Jain Sure. I just took a look at the code. Seems that once I get all content summaries of input table, I can make the guess on if join auto resolver will work for join operators on input tables. Because, as far as I know, existing util functions on retrieving content summaries (called after logical optimization) cannot be used directly at here, I need to write some util functions to get sizes of input tables. I will start to work on this asap. Also, although HIVE-3671 seems not hard to do, but it is not a quick fix. I suggest we track this work in a separate jira. Carl Steinbach Have you got time to look at current patch? Any comment?
          Hide
          Namit Jain added a comment -

          It would be a good idea to get HIVE-3671 in this patch.
          With HIVE-3671, the functionality will be much more useful to the whole community.
          Yin Huai, can you investigate getting HIVE-3671 as part of this patch, and see how much
          work is it ? Based on that, we can proceed.

          Show
          Namit Jain added a comment - It would be a good idea to get HIVE-3671 in this patch. With HIVE-3671 , the functionality will be much more useful to the whole community. Yin Huai , can you investigate getting HIVE-3671 as part of this patch, and see how much work is it ? Based on that, we can proceed.
          Hide
          Carl Steinbach added a comment -

          Thanks!

          Show
          Carl Steinbach added a comment - Thanks!
          Hide
          He Yongqiang added a comment -

          okay, i will target commit it this weekend or earlier next week.

          Show
          He Yongqiang added a comment - okay, i will target commit it this weekend or earlier next week.
          Hide
          Carl Steinbach added a comment -

          @Yongqiang: Please hold off on committing this for a day. Thanks.

          Show
          Carl Steinbach added a comment - @Yongqiang: Please hold off on committing this for a day. Thanks.
          Hide
          He Yongqiang added a comment -

          @Carl, keep in mind that you already months of time to comment. So maybe addressing your comments in new jiras will make more sense.

          Show
          He Yongqiang added a comment - @Carl, keep in mind that you already months of time to comment. So maybe addressing your comments in new jiras will make more sense.
          Hide
          He Yongqiang added a comment -

          @carl, you can go ahead comment, huai will address them in a sperate diff.

          Show
          He Yongqiang added a comment - @carl, you can go ahead comment, huai will address them in a sperate diff.
          Hide
          Carl Steinbach added a comment -

          @Yongqiang: Can you please hold off on committing while I take another look? Thanks.

          Show
          Carl Steinbach added a comment - @Yongqiang: Can you please hold off on committing while I take another look? Thanks.
          Hide
          He Yongqiang added a comment -

          +1, i will commit after tests pass.

          Show
          He Yongqiang added a comment - +1, i will commit after tests pass.
          Hide
          Yin Huai added a comment -

          Namit Jain
          Sure. I created the umbrella jira (HIVE-3667) for all work related to correlation optimizer and also created several follow-up jiras as sub-tasks. You can also add other sub-tasks into that jira.

          Show
          Yin Huai added a comment - Namit Jain Sure. I created the umbrella jira ( HIVE-3667 ) for all work related to correlation optimizer and also created several follow-up jiras as sub-tasks. You can also add other sub-tasks into that jira.
          Hide
          Namit Jain added a comment -

          Yin Huai, can you file follow-up jiras for the cases that dont work with this optimization ?
          It would be good to link them along with this jira. Adding them in the wiki would be useful too for tracking.

          Show
          Namit Jain added a comment - Yin Huai , can you file follow-up jiras for the cases that dont work with this optimization ? It would be good to link them along with this jira. Adding them in the wiki would be useful too for tracking.
          Hide
          Yin Huai added a comment -

          update a new patch which can be applied to r1404933. Also added the description of this issue.

          However, I do not have the permission to add a page in wiki. Where should I request the permission?

          Show
          Yin Huai added a comment - update a new patch which can be applied to r1404933. Also added the description of this issue. However, I do not have the permission to add a page in wiki. Where should I request the permission?
          Hide
          Yin Huai added a comment -

          alex gemini
          I do not have a short version description right now. Let me write one and create a wiki page.

          Show
          Yin Huai added a comment - alex gemini I do not have a short version description right now. Let me write one and create a wiki page.
          Hide
          alex gemini added a comment -

          Did this jira have a short version description? I know a join followed by group is optimized like pipeline, what else we may want to add to wiki?

          Show
          alex gemini added a comment - Did this jira have a short version description? I know a join followed by group is optimized like pipeline, what else we may want to add to wiki?
          Hide
          Yin Huai added a comment -

          @Namit:
          You can review the latest patch. I removed the first phase and other unnecessary contents.

          Show
          Yin Huai added a comment - @Namit: You can review the latest patch. I removed the first phase and other unnecessary contents.
          Hide
          Yin Huai added a comment -

          I just found I can remove the first phase of this optimizer. Apparently there were changes in the trunk, so I do not need to save original ColumnExprMap and OpParseCtx. I have removed unnecessary code and are running tests. Will update the patch later.

          Show
          Yin Huai added a comment - I just found I can remove the first phase of this optimizer. Apparently there were changes in the trunk, so I do not need to save original ColumnExprMap and OpParseCtx. I have removed unnecessary code and are running tests. Will update the patch later.
          Hide
          He Yongqiang added a comment -

          I will be on vacation this whole week. Given this is a very big diff, I will keep this open for another one week or two for more comments.

          Show
          He Yongqiang added a comment - I will be on vacation this whole week. Given this is a very big diff, I will keep this open for another one week or two for more comments.
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-h0.21 #1711 (See https://builds.apache.org/job/Hive-trunk-h0.21/1711/)
          HIVE-2206:add a new optimizer for query correlation discovery and optimization (Yin Huai via He Yongqiang) (Revision 1392105)

          Result = FAILURE
          heyongqiang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1392105
          Files :

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/conf/hive-default.xml.template
          • /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java
          • /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q
          • /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby1.q.xml
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml
          • /hive/trunk/ql/src/test/results/compiler/plan/groupby5.q.xml
          Show
          Hudson added a comment - Integrated in Hive-trunk-h0.21 #1711 (See https://builds.apache.org/job/Hive-trunk-h0.21/1711/ ) HIVE-2206 :add a new optimizer for query correlation discovery and optimization (Yin Huai via He Yongqiang) (Revision 1392105) Result = FAILURE heyongqiang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1392105 Files : /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/conf/hive-default.xml.template /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out /hive/trunk/ql/src/test/results/compiler/plan/groupby1.q.xml /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml /hive/trunk/ql/src/test/results/compiler/plan/groupby5.q.xml
          Hide
          Carl Steinbach added a comment -

          I did not see a 24 hours waiting on the bylaw page?

          This is specified in the "minimum length" column in the table that appears in the "Actions" section of the bylaws document. We could definitely make this easier to undertand, but all of the other committers already follow the convention that you +1 a patch before committing it, and allow some time to elapse in between those two actions in order to give other people a chance to weigh in.

          Show
          Carl Steinbach added a comment - I did not see a 24 hours waiting on the bylaw page? This is specified in the "minimum length" column in the table that appears in the "Actions" section of the bylaws document. We could definitely make this easier to undertand, but all of the other committers already follow the convention that you +1 a patch before committing it, and allow some time to elapse in between those two actions in order to give other people a chance to weigh in.
          Hide
          Namit Jain added a comment -

          Sorry for jumping in late on this. This is a pretty big feature - can you give me sometime to review this as well ?

          Show
          Namit Jain added a comment - Sorry for jumping in late on this. This is a pretty big feature - can you give me sometime to review this as well ?
          Hide
          He Yongqiang added a comment -

          @Carl, i just reverted. I will commit again tomorrow.

          Show
          He Yongqiang added a comment - @Carl, i just reverted. I will commit again tomorrow.
          Hide
          He Yongqiang added a comment -

          I did not see a 24 hours waiting on the bylaw page?

          Show
          He Yongqiang added a comment - I did not see a 24 hours waiting on the bylaw page?
          Hide
          Carl Steinbach added a comment -

          @Yongqiang: Sorry, but that's not the way it works. You vote +1 first, wait 24 hours, and then commit the patch. This is all covered in the project bylaws. Please revert this patch. Thanks.

          Show
          Carl Steinbach added a comment - @Yongqiang: Sorry, but that's not the way it works. You vote +1 first, wait 24 hours, and then commit the patch. This is all covered in the project bylaws. Please revert this patch. Thanks.
          Hide
          He Yongqiang added a comment -

          @Carl, btw, i did mentioned a few times on the comments that i am planing to commit this one.

          Show
          He Yongqiang added a comment - @Carl, btw, i did mentioned a few times on the comments that i am planing to commit this one.
          Hide
          He Yongqiang added a comment -

          I commented that all tests passed.

          ok, +1.

          Show
          He Yongqiang added a comment - I commented that all tests passed. ok, +1.
          Hide
          Carl Steinbach added a comment -

          @Yongqiang: I don't see a +1 vote in this JIRA. According to the project bylaws (https://cwiki.apache.org/confluence/display/Hive/Bylaws) this patch should not have been committed. Please back this patch out. Thanks.

          Show
          Carl Steinbach added a comment - @Yongqiang: I don't see a +1 vote in this JIRA. According to the project bylaws ( https://cwiki.apache.org/confluence/display/Hive/Bylaws ) this patch should not have been committed. Please back this patch out. Thanks.
          Hide
          He Yongqiang added a comment -

          I just committed. Thanks for the hard work, Yin Huai!

          Show
          He Yongqiang added a comment - I just committed. Thanks for the hard work, Yin Huai!
          Hide
          He Yongqiang added a comment -

          All tests passed for me.

          Show
          He Yongqiang added a comment - All tests passed for me.
          Hide
          Yin Huai added a comment -

          I corrected my local configurations related to HBase and checked out HIVE-3507, now all tests pass.

          Show
          Yin Huai added a comment - I corrected my local configurations related to HBase and checked out HIVE-3507 , now all tests pass.
          Hide
          Yin Huai added a comment -

          updated patch at reviewboard.

          @Carl: Pleas also see my comments under yours. Thanks.

          Show
          Yin Huai added a comment - updated patch at reviewboard. @Carl: Pleas also see my comments under yours. Thanks.
          Hide
          Carl Steinbach added a comment -

          @Yin: Please see my comments on reviewboard. Thanks.

          Show
          Carl Steinbach added a comment - @Yin: Please see my comments on reviewboard. Thanks.
          Hide
          Yin Huai added a comment -

          two new tests + bug fix. This patch is ready to review. Diff r4 in https://reviews.apache.org/r/7126/ is the latest patch.

          Show
          Yin Huai added a comment - two new tests + bug fix. This patch is ready to review. Diff r4 in https://reviews.apache.org/r/7126/ is the latest patch.
          Hide
          Yin Huai added a comment -

          patch updated. bug fix+ 3 test cases

          Show
          Yin Huai added a comment - patch updated. bug fix+ 3 test cases
          Hide
          He Yongqiang added a comment -

          The current patch looks ok.
          @Carl, please give more specific comments.

          We should agree on that new big features should not be enabled by default. That's too risky.

          Show
          He Yongqiang added a comment - The current patch looks ok. @Carl, please give more specific comments. We should agree on that new big features should not be enabled by default. That's too risky.
          Hide
          Yin Huai added a comment -

          Carl:
          The main reason that Yongqiang and I decided to disable this feature by default first is that we have not got a chance to test this optimizer heavily.

          Show
          Yin Huai added a comment - Carl: The main reason that Yongqiang and I decided to disable this feature by default first is that we have not got a chance to test this optimizer heavily.
          Hide
          Carl Steinbach added a comment -

          Please explain what is preventing us from enabling this feature by default, e.g. in which cases is it expected not to work, and what are the failure scenarios?

          Based on the current test coverage (not much) I can't tell if it's actually possible to use this feature in its current state.

          Show
          Carl Steinbach added a comment - Please explain what is preventing us from enabling this feature by default, e.g. in which cases is it expected not to work, and what are the failure scenarios? Based on the current test coverage (not much) I can't tell if it's actually possible to use this feature in its current state.
          Hide
          Yin Huai added a comment -

          Opened a new review request at https://reviews.apache.org/r/7126/, since I have been working on hive-git.

          Show
          Yin Huai added a comment - Opened a new review request at https://reviews.apache.org/r/7126/ , since I have been working on hive-git.
          Hide
          Yin Huai added a comment -

          new patch for trunk (revision 1385084). Disabled the optimizer by default and updated test results.

          He Yongqiang:
          can you help me test a few cases which the trunk on my machine cannot pass?
          Those are TestHBaseMinimrCliDriver, TestHBaseCliDriver, TestHBaseNegativeCliDriver, testSynchronized in TestEmbeddedHiveMetaStore, testSynchronized in TestRemoteHiveMetaStore, testSynchronized in TestSetUGIOnBothClientServer, testSynchronized in TestSetUGIOnOnlyClient, testSynchronized in TestSetUGIOnOnlyServer, and testNegativeCliDriver_local_mapred_error_cache in TestNegativeCliDriver. Thanks!

          Show
          Yin Huai added a comment - new patch for trunk (revision 1385084). Disabled the optimizer by default and updated test results. He Yongqiang : can you help me test a few cases which the trunk on my machine cannot pass? Those are TestHBaseMinimrCliDriver, TestHBaseCliDriver, TestHBaseNegativeCliDriver, testSynchronized in TestEmbeddedHiveMetaStore, testSynchronized in TestRemoteHiveMetaStore, testSynchronized in TestSetUGIOnBothClientServer, testSynchronized in TestSetUGIOnOnlyClient, testSynchronized in TestSetUGIOnOnlyServer, and testNegativeCliDriver_local_mapred_error_cache in TestNegativeCliDriver. Thanks!
          Hide
          Yin Huai added a comment -

          The patch is ported to the latest trunk (revision 1384442). I tested this patch with an enabled CorrelationOptimizer (hive.optimize.correlation=true). During the testing, I fixed several bugs and all tests should be ok except those I explained below.

          In case TestParse, there are 42 queries failed. Since I made several minor changes in SemanticAnalyzer. Seems those results should be updated.

          In TestCliDriver, auto_join26.q is failed since it is optimized by the optimizer. Considering I will make the optimizer disabled by default, I will not do any change regarding this query and its result.

          In TestCliDriver, create_view.q and udaf_percentile_approx.q are two weird queries. If hive.map.aggr=false, the original trunk will also fail. Seems bug is involved in the trunk. I have sent an email to dev mailing list regarding create_view.q. For udaf_percentile_approx.q, I have got time to look at it in detail.

          In TestCliDriver, join31.q is failed. For this case, the query should be updated to have "set hive.optimize.correlation=true". But, since the optimizer is disabled by default, I will not update this query.

          Also, I got some queries which trunk cannot pass. These are cascade_dbdrop_hadoop20.q, hbase_binary_external_table_queries.q, hbase_binary_map_queries.q, hbase_binary_storage_queries.q, hbase_joins.q, hbase_ppd_key_range.q, hbase_pushdown.q, hbase_queries.q, local_mapred_error_cache.q, and TestCase TestHBaseMinimrCliDriver.

          I will run all tests again and will fix any bug related to the patch.

          Show
          Yin Huai added a comment - The patch is ported to the latest trunk (revision 1384442). I tested this patch with an enabled CorrelationOptimizer (hive.optimize.correlation=true). During the testing, I fixed several bugs and all tests should be ok except those I explained below. In case TestParse, there are 42 queries failed. Since I made several minor changes in SemanticAnalyzer. Seems those results should be updated. In TestCliDriver, auto_join26.q is failed since it is optimized by the optimizer. Considering I will make the optimizer disabled by default, I will not do any change regarding this query and its result. In TestCliDriver, create_view.q and udaf_percentile_approx.q are two weird queries. If hive.map.aggr=false, the original trunk will also fail. Seems bug is involved in the trunk. I have sent an email to dev mailing list regarding create_view.q. For udaf_percentile_approx.q, I have got time to look at it in detail. In TestCliDriver, join31.q is failed. For this case, the query should be updated to have "set hive.optimize.correlation=true". But, since the optimizer is disabled by default, I will not update this query. Also, I got some queries which trunk cannot pass. These are cascade_dbdrop_hadoop20.q, hbase_binary_external_table_queries.q, hbase_binary_map_queries.q, hbase_binary_storage_queries.q, hbase_joins.q, hbase_ppd_key_range.q, hbase_pushdown.q, hbase_queries.q, local_mapred_error_cache.q, and TestCase TestHBaseMinimrCliDriver. I will run all tests again and will fix any bug related to the patch.
          Hide
          He Yongqiang added a comment -

          For the last few months (almost one year), Yin has been actively maintaining this patch, and i think it is in a very good state to check into trunk.

          So i will do some final review, and hope to commit it sometime next month. Please feel free to jump in to review the patch and put any comments here before the commit.

          In the last review, I will make sure this patch will not have big changes to existing execution path, so it can be simply disabled like other optimizations in Hive. And Yin will still be actively maintaining this patch (help fix bugs etc) after the commit.

          Show
          He Yongqiang added a comment - For the last few months (almost one year), Yin has been actively maintaining this patch, and i think it is in a very good state to check into trunk. So i will do some final review, and hope to commit it sometime next month. Please feel free to jump in to review the patch and put any comments here before the commit. In the last review, I will make sure this patch will not have big changes to existing execution path, so it can be simply disabled like other optimizations in Hive. And Yin will still be actively maintaining this patch (help fix bugs etc) after the commit.
          Hide
          Yin Huai added a comment -

          @anders,
          I will try to make this patch more concise and easier to read than the current version. If I have any thought on the optimization framework, I will comment under HIVE-3027.

          Show
          Yin Huai added a comment - @anders, I will try to make this patch more concise and easier to read than the current version. If I have any thought on the optimization framework, I will comment under HIVE-3027 .
          Hide
          anders added a comment -

          I read about the patch and I think the hive's optimizer framework is bad.
          Detail: https://issues.apache.org/jira/browse/HIVE-3027
          Can you improve it?

          Show
          anders added a comment - I read about the patch and I think the hive's optimizer framework is bad. Detail: https://issues.apache.org/jira/browse/HIVE-3027 Can you improve it?
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2001/
          -----------------------------------------------------------

          (Updated 2012-02-10 20:49:01.177796)

          Review request for hive.

          Changes
          -------

          updated patch on revision 1237253. Will generate the patch based on the latest trunk latter.

          Summary
          -------

          This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.

          This addresses bug HIVE-2206.
          https://issues.apache.org/jira/browse/HIVE-2206

          Diffs (updated)


          trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1237326
          trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1237326
          trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1237326
          trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1237326
          trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1237326
          trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1237326

          Diff: https://reviews.apache.org/r/2001/diff

          Testing
          -------

          Thanks,

          Yin

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2001/ ----------------------------------------------------------- (Updated 2012-02-10 20:49:01.177796) Review request for hive. Changes ------- updated patch on revision 1237253. Will generate the patch based on the latest trunk latter. Summary ------- This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs. This addresses bug HIVE-2206 . https://issues.apache.org/jira/browse/HIVE-2206 Diffs (updated) trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1237326 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1237326 trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1237326 trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1237326 trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1237326 trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1237326 Diff: https://reviews.apache.org/r/2001/diff Testing ------- Thanks, Yin
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2012-02-10 17:38:09, Kevin Wilfong wrote:

          > I've started reviewing this, here's my comments so far. I'll continue to look over it.

          I will update this patch soon.

          On 2012-02-10 17:38:09, Kevin Wilfong wrote:

          > trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 453

          > <https://reviews.apache.org/r/2001/diff/4/?file=71297#file71297line453>

          >

          > Does this have to default to false, does anything break if it's true?

          >

          > Similarly, have you tried running the tests with this set to true?

          I have not tried running the tests with this set to true. I will do it when I find a revision which can pass all unit tests (btw, any suggestion on which revision should I use?). In my opinion, since this optimizer is kind of complicated and it is still being developed, it will be safer to default it to false and let users to decide when to use it than default it to true.

          On 2012-02-10 17:38:09, Kevin Wilfong wrote:

          > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java, line 101

          > <https://reviews.apache.org/r/2001/diff/4/?file=71299#file71299line101>

          >

          > It's not clear to me why we need both setRowNumber and processOp.

          Since a CorrelationCompositeOperator may have multiple parents, I used a buffer to store the output of parents of the CorrelationCompositeOperator (shown processOp method). The TableScanOperator will trigger the setRowNumber method and then CorrelationCompositeOperator will decide the operationPathTags of this row based on the contents in the buffer and then forward the row in its buffer to its child. So, setRowNumber in here is used to evaluate the operationPathTags of the row in the buffer before the CorrelationCompositeOperator gets the new row.

          On 2012-02-10 17:38:09, Kevin Wilfong wrote:

          > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java, lines 150-177

          > <https://reviews.apache.org/r/2001/diff/4/?file=71299#file71299line150>

          >

          > Putting this code in a helper method would be better than having it both here and in setRowNumber.

          I will do it.

          On 2012-02-10 17:38:09, Kevin Wilfong wrote:

          > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java, line 274

          > <https://reviews.apache.org/r/2001/diff/4/?file=71300#file71300line274>

          >

          > Does this commented out code need to be kept?

          This commented out code is not needed. I will delete it.

          On 2012-02-10 17:38:09, Kevin Wilfong wrote:

          > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java, line 1337

          > <https://reviews.apache.org/r/2001/diff/4/?file=71303#file71303line1337>

          >

          > I couldn't find a CorrelationFakeReduceSinkOperator class.

          CorrelationLocalSimulativeReduceSinkOperator was named as CorrelationFakeReduceSinkOperator. I will use the right name in the comment.

          On 2012-02-10 17:38:09, Kevin Wilfong wrote:

          > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java, line 273

          > <https://reviews.apache.org/r/2001/diff/4/?file=71305#file71305line273>

          >

          > Tabs are bad, could you change them to spaces, at least in the new code your introducing.

          I will change the format of my code. Thanks for letting me know.

          On 2012-02-10 17:38:09, Kevin Wilfong wrote:

          > trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, line 239

          > <https://reviews.apache.org/r/2001/diff/4/?file=71308#file71308line239>

          >

          > I take it from this line it's a requirement that in order for this correlation optimization to be attempted every reduce sink has to be followed only by children with a single child.

          >

          > Could this be relaxed? Could the optimization simply not be applied if there is an operator between two ReduceSinks that has more than one child?

          >

          > Also, if there is a ReduceSink which is not followed by another ReduceSink, but is followed by an operator with more than one child, this prevents the optimization from being used, even though it shouldn't have an effect.

          >

          > Also, regarding checking if the size <=1, if the size <1 the next line will throw an exception.

          Only "assert op.getChildOperators().size() > 0;" is needed at here. Thank you for letting me know.

          On 2012-02-10 17:38:09, Kevin Wilfong wrote:

          > trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, line 335

          > <https://reviews.apache.org/r/2001/diff/4/?file=71308#file71308line335>

          >

          > findNextChildReduceSinkOperator can return null, do you need to check for this?

          findNextChildReduceSinkOperator will not return null since its input will not be the last ReduceSinkOperator before the FileSinkOperator. For example, suppose that we have a plan tree like (some operators)>RS1>(some operators)>RS2>(some operators)->FS. The input of findNextChildReduceSinkOperator will not be RS2. I will add an assertion and a comment after this line.

          • Yin

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2001/#review4912
          -----------------------------------------------------------

          On 2012-01-29 17:56:48, Yin Huai wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/2001/

          -----------------------------------------------------------

          (Updated 2012-01-29 17:56:48)

          Review request for hive.

          Summary

          -------

          This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.

          This addresses bug HIVE-2206.

          https://issues.apache.org/jira/browse/HIVE-2206

          Diffs

          -----

          trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1237326

          trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1237326

          trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1237326

          trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1237326

          trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1237326

          trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1237326

          Diff: https://reviews.apache.org/r/2001/diff

          Testing

          -------

          Thanks,

          Yin

          Show
          jiraposter@reviews.apache.org added a comment - On 2012-02-10 17:38:09, Kevin Wilfong wrote: > I've started reviewing this, here's my comments so far. I'll continue to look over it. I will update this patch soon. On 2012-02-10 17:38:09, Kevin Wilfong wrote: > trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 453 > < https://reviews.apache.org/r/2001/diff/4/?file=71297#file71297line453 > > > Does this have to default to false, does anything break if it's true? > > Similarly, have you tried running the tests with this set to true? I have not tried running the tests with this set to true. I will do it when I find a revision which can pass all unit tests (btw, any suggestion on which revision should I use?). In my opinion, since this optimizer is kind of complicated and it is still being developed, it will be safer to default it to false and let users to decide when to use it than default it to true. On 2012-02-10 17:38:09, Kevin Wilfong wrote: > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java, line 101 > < https://reviews.apache.org/r/2001/diff/4/?file=71299#file71299line101 > > > It's not clear to me why we need both setRowNumber and processOp. Since a CorrelationCompositeOperator may have multiple parents, I used a buffer to store the output of parents of the CorrelationCompositeOperator (shown processOp method). The TableScanOperator will trigger the setRowNumber method and then CorrelationCompositeOperator will decide the operationPathTags of this row based on the contents in the buffer and then forward the row in its buffer to its child. So, setRowNumber in here is used to evaluate the operationPathTags of the row in the buffer before the CorrelationCompositeOperator gets the new row. On 2012-02-10 17:38:09, Kevin Wilfong wrote: > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java, lines 150-177 > < https://reviews.apache.org/r/2001/diff/4/?file=71299#file71299line150 > > > Putting this code in a helper method would be better than having it both here and in setRowNumber. I will do it. On 2012-02-10 17:38:09, Kevin Wilfong wrote: > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java, line 274 > < https://reviews.apache.org/r/2001/diff/4/?file=71300#file71300line274 > > > Does this commented out code need to be kept? This commented out code is not needed. I will delete it. On 2012-02-10 17:38:09, Kevin Wilfong wrote: > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java, line 1337 > < https://reviews.apache.org/r/2001/diff/4/?file=71303#file71303line1337 > > > I couldn't find a CorrelationFakeReduceSinkOperator class. CorrelationLocalSimulativeReduceSinkOperator was named as CorrelationFakeReduceSinkOperator. I will use the right name in the comment. On 2012-02-10 17:38:09, Kevin Wilfong wrote: > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java, line 273 > < https://reviews.apache.org/r/2001/diff/4/?file=71305#file71305line273 > > > Tabs are bad, could you change them to spaces, at least in the new code your introducing. I will change the format of my code. Thanks for letting me know. On 2012-02-10 17:38:09, Kevin Wilfong wrote: > trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, line 239 > < https://reviews.apache.org/r/2001/diff/4/?file=71308#file71308line239 > > > I take it from this line it's a requirement that in order for this correlation optimization to be attempted every reduce sink has to be followed only by children with a single child. > > Could this be relaxed? Could the optimization simply not be applied if there is an operator between two ReduceSinks that has more than one child? > > Also, if there is a ReduceSink which is not followed by another ReduceSink, but is followed by an operator with more than one child, this prevents the optimization from being used, even though it shouldn't have an effect. > > Also, regarding checking if the size <=1, if the size <1 the next line will throw an exception. Only "assert op.getChildOperators().size() > 0;" is needed at here. Thank you for letting me know. On 2012-02-10 17:38:09, Kevin Wilfong wrote: > trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, line 335 > < https://reviews.apache.org/r/2001/diff/4/?file=71308#file71308line335 > > > findNextChildReduceSinkOperator can return null, do you need to check for this? findNextChildReduceSinkOperator will not return null since its input will not be the last ReduceSinkOperator before the FileSinkOperator. For example, suppose that we have a plan tree like (some operators) >RS1 >(some operators) >RS2 >(some operators)->FS. The input of findNextChildReduceSinkOperator will not be RS2. I will add an assertion and a comment after this line. Yin ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2001/#review4912 ----------------------------------------------------------- On 2012-01-29 17:56:48, Yin Huai wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2001/ ----------------------------------------------------------- (Updated 2012-01-29 17:56:48) Review request for hive. Summary ------- This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs. This addresses bug HIVE-2206 . https://issues.apache.org/jira/browse/HIVE-2206 Diffs ----- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1237326 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1237326 trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1237326 trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1237326 trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1237326 trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1237326 Diff: https://reviews.apache.org/r/2001/diff Testing ------- Thanks, Yin
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2001/#review4912
          -----------------------------------------------------------

          I've started reviewing this, here's my comments so far. I'll continue to look over it.

          trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          <https://reviews.apache.org/r/2001/#comment11010>

          Does this have to default to false, does anything break if it's true?

          Similarly, have you tried running the tests with this set to true?

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
          <https://reviews.apache.org/r/2001/#comment10818>

          It's not clear to me why we need both setRowNumber and processOp.

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
          <https://reviews.apache.org/r/2001/#comment10817>

          Putting this code in a helper method would be better than having it both here and in setRowNumber.

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
          <https://reviews.apache.org/r/2001/#comment10819>

          Does this commented out code need to be kept?

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          <https://reviews.apache.org/r/2001/#comment10820>

          I couldn't find a CorrelationFakeReduceSinkOperator class.

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          <https://reviews.apache.org/r/2001/#comment10821>

          Tabs are bad, could you change them to spaces, at least in the new code your introducing.

          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
          <https://reviews.apache.org/r/2001/#comment10850>

          I take it from this line it's a requirement that in order for this correlation optimization to be attempted every reduce sink has to be followed only by children with a single child.

          Could this be relaxed? Could the optimization simply not be applied if there is an operator between two ReduceSinks that has more than one child?

          Also, if there is a ReduceSink which is not followed by another ReduceSink, but is followed by an operator with more than one child, this prevents the optimization from being used, even though it shouldn't have an effect.

          Also, regarding checking if the size <=1, if the size <1 the next line will throw an exception.

          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
          <https://reviews.apache.org/r/2001/#comment10851>

          findNextChildReduceSinkOperator can return null, do you need to check for this?

          • Kevin

          On 2012-01-29 17:56:48, Yin Huai wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/2001/

          -----------------------------------------------------------

          (Updated 2012-01-29 17:56:48)

          Review request for hive.

          Summary

          -------

          This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.

          This addresses bug HIVE-2206.

          https://issues.apache.org/jira/browse/HIVE-2206

          Diffs

          -----

          trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1237326

          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1237326

          trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1237326

          trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1237326

          trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1237326

          trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1237326

          trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1237326

          Diff: https://reviews.apache.org/r/2001/diff

          Testing

          -------

          Thanks,

          Yin

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2001/#review4912 ----------------------------------------------------------- I've started reviewing this, here's my comments so far. I'll continue to look over it. trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java < https://reviews.apache.org/r/2001/#comment11010 > Does this have to default to false, does anything break if it's true? Similarly, have you tried running the tests with this set to true? trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java < https://reviews.apache.org/r/2001/#comment10818 > It's not clear to me why we need both setRowNumber and processOp. trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java < https://reviews.apache.org/r/2001/#comment10817 > Putting this code in a helper method would be better than having it both here and in setRowNumber. trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java < https://reviews.apache.org/r/2001/#comment10819 > Does this commented out code need to be kept? trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java < https://reviews.apache.org/r/2001/#comment10820 > I couldn't find a CorrelationFakeReduceSinkOperator class. trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java < https://reviews.apache.org/r/2001/#comment10821 > Tabs are bad, could you change them to spaces, at least in the new code your introducing. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java < https://reviews.apache.org/r/2001/#comment10850 > I take it from this line it's a requirement that in order for this correlation optimization to be attempted every reduce sink has to be followed only by children with a single child. Could this be relaxed? Could the optimization simply not be applied if there is an operator between two ReduceSinks that has more than one child? Also, if there is a ReduceSink which is not followed by another ReduceSink, but is followed by an operator with more than one child, this prevents the optimization from being used, even though it shouldn't have an effect. Also, regarding checking if the size <=1, if the size <1 the next line will throw an exception. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java < https://reviews.apache.org/r/2001/#comment10851 > findNextChildReduceSinkOperator can return null, do you need to check for this? Kevin On 2012-01-29 17:56:48, Yin Huai wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2001/ ----------------------------------------------------------- (Updated 2012-01-29 17:56:48) Review request for hive. Summary ------- This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs. This addresses bug HIVE-2206 . https://issues.apache.org/jira/browse/HIVE-2206 Diffs ----- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1237326 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1237326 trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1237326 trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1237326 trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1237326 trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1237326 Diff: https://reviews.apache.org/r/2001/diff Testing ------- Thanks, Yin
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2001/
          -----------------------------------------------------------

          (Updated 2012-01-29 17:56:48.704757)

          Review request for hive.

          Changes
          -------

          make the patch compatible with latest trunk (revision 1237253).

          Summary
          -------

          This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.

          This addresses bug HIVE-2206.
          https://issues.apache.org/jira/browse/HIVE-2206

          Diffs (updated)


          trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1237326
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1237326
          trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1237326
          trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1237326
          trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1237326
          trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1237326
          trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1237326

          Diff: https://reviews.apache.org/r/2001/diff

          Testing
          -------

          Thanks,

          Yin

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2001/ ----------------------------------------------------------- (Updated 2012-01-29 17:56:48.704757) Review request for hive. Changes ------- make the patch compatible with latest trunk (revision 1237253). Summary ------- This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs. This addresses bug HIVE-2206 . https://issues.apache.org/jira/browse/HIVE-2206 Diffs (updated) trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1237326 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1237326 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1237326 trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1237326 trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1237326 trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1237326 trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1237326 Diff: https://reviews.apache.org/r/2001/diff Testing ------- Thanks, Yin
          Hide
          Yin Huai added a comment -

          @Kevin,
          I wrongly assumed that all output names of the ReduceSinkOperator has a structure of "KEY/VALUE.internalName". I have solved this issue.

          However, the current optimizer cannot handel the case that a table is directly connect to a post computation operator (in this case, table b directly connects to the operator join). I am planning to solve this issue after this patch. To walkaround, you can use ...
          SET hive.optimize.reducededuplication=false;
          SET hive.optimize.correlation=true;
          SELECT * FROM (SELECT * FROM src DISTRIBUTE BY key SORT BY key) a JOIN (SELECT * FROM src DISTRIBUTE BY key SORT BY key) b ON a.key = b.key;.
          This query will be optimized and be executed in a single MapReduce job.

          Also, I have updated the patch and it is compatible with revision 1237253.

          Show
          Yin Huai added a comment - @Kevin, I wrongly assumed that all output names of the ReduceSinkOperator has a structure of "KEY/VALUE.internalName". I have solved this issue. However, the current optimizer cannot handel the case that a table is directly connect to a post computation operator (in this case, table b directly connects to the operator join). I am planning to solve this issue after this patch. To walkaround, you can use ... SET hive.optimize.reducededuplication=false; SET hive.optimize.correlation=true; SELECT * FROM (SELECT * FROM src DISTRIBUTE BY key SORT BY key) a JOIN (SELECT * FROM src DISTRIBUTE BY key SORT BY key) b ON a.key = b.key;. This query will be optimized and be executed in a single MapReduce job. Also, I have updated the patch and it is compatible with revision 1237253.
          Hide
          Yin Huai added a comment -

          @Kevin,
          I will take a look at it.

          Show
          Yin Huai added a comment - @Kevin, I will take a look at it.
          Hide
          Kevin Wilfong added a comment -

          The above bug is a pre-existing issue with reduce sink reduplication.

          The following new exception is produced by the query:

          set hive.optimize.reducededuplication=false;
          explain select * from (select * from src distribute by key sort by key) a join src b on a.key = b.key;

          FAILED: Hive Internal Error: java.lang.ArrayIndexOutOfBoundsException(1)
          java.lang.ArrayIndexOutOfBoundsException: 1
          at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizerUtils.createCorrelationCompositeReducesinkOperaotr(CorrelationOptimizerUtils.java:599)
          at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizerUtils.applyCorrelation(CorrelationOptimizerUtils.java:365)
          at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer.transform(CorrelationOptimizer.java:198)
          at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:100)
          at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7384)
          at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
          at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:50)
          at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
          at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430)
          at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
          at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889)
          at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
          at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
          at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
          at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
          at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

          Show
          Kevin Wilfong added a comment - The above bug is a pre-existing issue with reduce sink reduplication. The following new exception is produced by the query: set hive.optimize.reducededuplication=false; explain select * from (select * from src distribute by key sort by key) a join src b on a.key = b.key; FAILED: Hive Internal Error: java.lang.ArrayIndexOutOfBoundsException(1) java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizerUtils.createCorrelationCompositeReducesinkOperaotr(CorrelationOptimizerUtils.java:599) at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizerUtils.applyCorrelation(CorrelationOptimizerUtils.java:365) at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer.transform(CorrelationOptimizer.java:198) at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:100) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7384) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:50) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
          Hide
          Kevin Wilfong added a comment -

          Nevermind, sorry, it was the distribute by followed by sort by.

          Show
          Kevin Wilfong added a comment - Nevermind, sorry, it was the distribute by followed by sort by.
          Hide
          Kevin Wilfong added a comment -

          I tried running

          explain select * from (select * from src distribute by key sort by key) a join src b on a.key = b.key;

          using HIVE-2206.8.r1224646.patch.txt and I get the following exception:

          FAILED: Hive Internal Error: java.lang.ClassCastException(org.apache.hadoop.hive.ql.exec.SelectOperator cannot be cast to org.apache.hadoop.hive.ql.exec.ReduceSinkOperator)
          java.lang.ClassCastException: org.apache.hadoop.hive.ql.exec.SelectOperator cannot be cast to org.apache.hadoop.hive.ql.exec.ReduceSinkOperator
          at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer$CorrelationNodeProc.findPeerReduceSinkOperators(CorrelationOptimizer.java:256)
          at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer$CorrelationNodeProc.process(CorrelationOptimizer.java:503)
          at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
          at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
          at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
          at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
          at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer.transform(CorrelationOptimizer.java:193)
          at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:100)
          at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7384)
          at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
          at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:50)
          at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
          at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430)
          at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
          at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889)
          at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
          at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
          at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
          at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
          at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

          Show
          Kevin Wilfong added a comment - I tried running explain select * from (select * from src distribute by key sort by key) a join src b on a.key = b.key; using HIVE-2206 .8.r1224646.patch.txt and I get the following exception: FAILED: Hive Internal Error: java.lang.ClassCastException(org.apache.hadoop.hive.ql.exec.SelectOperator cannot be cast to org.apache.hadoop.hive.ql.exec.ReduceSinkOperator) java.lang.ClassCastException: org.apache.hadoop.hive.ql.exec.SelectOperator cannot be cast to org.apache.hadoop.hive.ql.exec.ReduceSinkOperator at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer$CorrelationNodeProc.findPeerReduceSinkOperators(CorrelationOptimizer.java:256) at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer$CorrelationNodeProc.process(CorrelationOptimizer.java:503) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102) at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer.transform(CorrelationOptimizer.java:193) at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:100) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7384) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:50) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
          Hide
          Yin Huai added a comment -

          testQueries.2.q have three testing queries

          Show
          Yin Huai added a comment - testQueries.2.q have three testing queries
          Hide
          Yin Huai added a comment -

          diff in the review board has also been updated (https://reviews.apache.org/r/2001/).

          Show
          Yin Huai added a comment - diff in the review board has also been updated ( https://reviews.apache.org/r/2001/ ).
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2001/
          -----------------------------------------------------------

          (Updated 2011-12-29 18:50:12.277210)

          Review request for hive.

          Summary
          -------

          This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.

          This addresses bug HIVE-2206.
          https://issues.apache.org/jira/browse/HIVE-2206

          Diffs (updated)


          trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1224666
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1224666
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1224666
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1224666
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1224666
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1224666
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1224666
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1224666
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1224666
          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1224666
          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1224666
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1224666
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1224666
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1224666
          trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1224666
          trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1224666
          trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1224666
          trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1224666
          trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1224666

          Diff: https://reviews.apache.org/r/2001/diff

          Testing (updated)
          -------

          Thanks,

          Yin

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2001/ ----------------------------------------------------------- (Updated 2011-12-29 18:50:12.277210) Review request for hive. Summary ------- This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs. This addresses bug HIVE-2206 . https://issues.apache.org/jira/browse/HIVE-2206 Diffs (updated) trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1224666 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1224666 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1224666 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1224666 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1224666 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1224666 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1224666 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1224666 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1224666 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1224666 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1224666 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1224666 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1224666 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1224666 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1224666 trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1224666 trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1224666 trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1224666 trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1224666 Diff: https://reviews.apache.org/r/2001/diff Testing (updated) ------- Thanks, Yin
          Hide
          Yin Huai added a comment -

          New version of patch and testing queries are available. I also updated the diff in the review request (link: https://reviews.apache.org/r/2001/).

          Show
          Yin Huai added a comment - New version of patch and testing queries are available. I also updated the diff in the review request (link: https://reviews.apache.org/r/2001/ ).
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2001/
          -----------------------------------------------------------

          (Updated 2011-12-05 19:12:23.087778)

          Review request for hive.

          Changes
          -------

          CorrelationReduceSinkOperator has been merged into ReduceSinkOperator. Detailed comments has been added to new operator.

          Summary
          -------

          This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.

          This addresses bug HIVE-2206.
          https://issues.apache.org/jira/browse/HIVE-2206

          Diffs (updated)


          trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1210283
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationFakeReduceSinkOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationManualForwardOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1210283
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1210283
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1210283
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1210283
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1210283
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1210283
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1210283
          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1210283
          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1210283
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationFakeReduceSinkDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationManualForwardDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1210283
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1210283
          trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1210283
          trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1210283
          trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1210283
          trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1210283
          trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1210283

          Diff: https://reviews.apache.org/r/2001/diff

          Testing (updated)
          -------

          Previous version of diff passed all unit tests. Since the latest trunk (r1209696) cannot finish all of unit tests, the latest version of diff has not been tested.

          Thanks,

          Yin

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2001/ ----------------------------------------------------------- (Updated 2011-12-05 19:12:23.087778) Review request for hive. Changes ------- CorrelationReduceSinkOperator has been merged into ReduceSinkOperator. Detailed comments has been added to new operator. Summary ------- This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs. This addresses bug HIVE-2206 . https://issues.apache.org/jira/browse/HIVE-2206 Diffs (updated) trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1210283 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationFakeReduceSinkOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationManualForwardOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1210283 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1210283 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1210283 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1210283 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1210283 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1210283 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1210283 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1210283 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1210283 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationFakeReduceSinkDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationManualForwardDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1210283 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1210283 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1210283 trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1210283 trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1210283 trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1210283 trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1210283 Diff: https://reviews.apache.org/r/2001/diff Testing (updated) ------- Previous version of diff passed all unit tests. Since the latest trunk (r1209696) cannot finish all of unit tests, the latest version of diff has not been tested. Thanks, Yin
          Hide
          Yin Huai added a comment -

          This is a working-in-progress patch. There are two issues to be addressed before next version of patch. Firstly, I will look at if I can remove the operator FakeReduceSinkOperator. Secondly, I will look at if correlation optimizer can be a one-phase optimizer instead of two-phase one. The current implementation will call the correlation optimizer twice (at the beginning and the end of optimization, respectively).

          Show
          Yin Huai added a comment - This is a working-in-progress patch. There are two issues to be addressed before next version of patch. Firstly, I will look at if I can remove the operator FakeReduceSinkOperator. Secondly, I will look at if correlation optimizer can be a one-phase optimizer instead of two-phase one. The current implementation will call the correlation optimizer twice (at the beginning and the end of optimization, respectively).
          Hide
          Yin Huai added a comment -

          Submitted a review request. The link is https://reviews.apache.org/r/2001/.

          Show
          Yin Huai added a comment - Submitted a review request. The link is https://reviews.apache.org/r/2001/ .
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2001/
          -----------------------------------------------------------

          Review request for hive.

          Summary
          -------

          This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.

          This addresses bug HIVE-2206.
          https://issues.apache.org/jira/browse/HIVE-2206

          Diffs


          trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1173271
          trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java 1173271
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationDispatchOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationFakeReduceSinkOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationManualForwardOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReduceSinkOperator.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1173271
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 1173271
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1173271
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1173271
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationGenMRRedSink1.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1173271
          trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1173271
          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1173271
          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1173271
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationDispatchDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationFakeReduceSinkDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationManualForwardDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReduceSinkDesc.java PRE-CREATION
          trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1173271
          trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCountDistinct.java PRE-CREATION
          trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1173271
          trunk/ql/src/test/results/clientpositive/show_functions.q.out 1173271
          trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1173271
          trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1173271
          trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1173271
          trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1173271

          Diff: https://reviews.apache.org/r/2001/diff

          Testing
          -------

          Ran all unit tests

          Thanks,

          Yin

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2001/ ----------------------------------------------------------- Review request for hive. Summary ------- This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs. This addresses bug HIVE-2206 . https://issues.apache.org/jira/browse/HIVE-2206 Diffs trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1173271 trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java 1173271 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationDispatchOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationFakeReduceSinkOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationManualForwardOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReduceSinkOperator.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1173271 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 1173271 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1173271 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1173271 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationGenMRRedSink1.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1173271 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1173271 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1173271 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1173271 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationDispatchDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationFakeReduceSinkDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationManualForwardDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReduceSinkDesc.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1173271 trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCountDistinct.java PRE-CREATION trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1173271 trunk/ql/src/test/results/clientpositive/show_functions.q.out 1173271 trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1173271 trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1173271 trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1173271 trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1173271 Diff: https://reviews.apache.org/r/2001/diff Testing ------- Ran all unit tests Thanks, Yin
          Hide
          Yin Huai added a comment -

          All of unit test cases are passed, except TestParse_groupby1, TestParse_groupby2, TestParse_groupby3 and TestParse_groupby5. Because I made some changes in method "genGroupByPlanReduceSinkOperator" of class "SemanticAnalyzer", so results of these four cases should be updated. However, I found that when these four cases are tested individually, the results differ from the results when these four cases are tested with all other cases (when I used "ant clean package test tar -logfile ant.log"). Hive-trunk I checked out from svn also has this issue. Is it a bug or did I miss anything?

          This patch does not contain updates on the results of cases TestParse_groupby1, TestParse_groupby2, TestParse_groupby3 and TestParse_groupby5.

          Show
          Yin Huai added a comment - All of unit test cases are passed, except TestParse_groupby1, TestParse_groupby2, TestParse_groupby3 and TestParse_groupby5. Because I made some changes in method "genGroupByPlanReduceSinkOperator" of class "SemanticAnalyzer", so results of these four cases should be updated. However, I found that when these four cases are tested individually, the results differ from the results when these four cases are tested with all other cases (when I used "ant clean package test tar -logfile ant.log"). Hive-trunk I checked out from svn also has this issue. Is it a bug or did I miss anything? This patch does not contain updates on the results of cases TestParse_groupby1, TestParse_groupby2, TestParse_groupby3 and TestParse_groupby5.
          Hide
          Yin Huai added a comment -

          @Ashutosh, Thanks

          Show
          Yin Huai added a comment - @Ashutosh, Thanks
          Hide
          Ashutosh Chauhan added a comment -

          @Yin,

          To overwrite current results you can do the following:

          ant test -Dtestcase=TestCliDriver -Dqfile=groupby1.q -Doverwrite=true
          
          Show
          Ashutosh Chauhan added a comment - @Yin, To overwrite current results you can do the following: ant test -Dtestcase=TestCliDriver -Dqfile=groupby1.q -Doverwrite= true
          Hide
          Yin Huai added a comment -

          In unit tests, there are four failures in TestParse (groupby1, groupby2, groupby3 and groupby5). These four failures are caused by changes I made in the method "genGroupByPlanReduceSinkOperator" in the class "SemanticAnalyzer". Current results should be updated. But I am not sure how to change the correct results. Need some suggestions. Thanks.

          Show
          Yin Huai added a comment - In unit tests, there are four failures in TestParse (groupby1, groupby2, groupby3 and groupby5). These four failures are caused by changes I made in the method "genGroupByPlanReduceSinkOperator" in the class "SemanticAnalyzer". Current results should be updated. But I am not sure how to change the correct results. Need some suggestions. Thanks.
          Hide
          Yin Huai added a comment -

          I used "ant clean package test tar -logfile ant.log" to test all cases again and all unknown errors are gone... There are four failures left (groupby1, groupby2, groupby3 and groupby5). These four failures are caused by changes I made in the method "genGroupByPlanReduceSinkOperator" in the class "SemanticAnalyzer". So, current results should be updated. But I am not sure how to change the correct results. Does overwrite current results with the new results work? Or, is there anything I should do?

          Show
          Yin Huai added a comment - I used "ant clean package test tar -logfile ant.log" to test all cases again and all unknown errors are gone... There are four failures left (groupby1, groupby2, groupby3 and groupby5). These four failures are caused by changes I made in the method "genGroupByPlanReduceSinkOperator" in the class "SemanticAnalyzer". So, current results should be updated. But I am not sure how to change the correct results. Does overwrite current results with the new results work? Or, is there anything I should do?
          Hide
          Yin Huai added a comment -

          there are some failures in TestCliDriver. Some failures seems that the output misses some lines when using keyword "explain". Other failures are related to queries with index, e.g. index_quth.q. When I tested query index_quth.q manually, there were two errors. One was "java.lang.ClassNotFoundException: org.apache.derby.jdbc.EmbeddedDriver" and another one was "java.lang.NoClassDefFoundError: javaewah/EWAHCompressedBitmap".

          These errors seems irrelevance to changes I made, but there should be some thing wrong...

          @Yongqiang: Can you have a look at my patch and give me some suggestions on how to fix it? Thanks

          Show
          Yin Huai added a comment - there are some failures in TestCliDriver. Some failures seems that the output misses some lines when using keyword "explain". Other failures are related to queries with index, e.g. index_quth.q. When I tested query index_quth.q manually, there were two errors. One was "java.lang.ClassNotFoundException: org.apache.derby.jdbc.EmbeddedDriver" and another one was "java.lang.NoClassDefFoundError: javaewah/EWAHCompressedBitmap". These errors seems irrelevance to changes I made, but there should be some thing wrong... @Yongqiang: Can you have a look at my patch and give me some suggestions on how to fix it? Thanks
          Hide
          Yin Huai added a comment -

          found a bug in the HIVE-2206.1.patch.txt. will upload a update version later.

          Show
          Yin Huai added a comment - found a bug in the HIVE-2206 .1.patch.txt. will upload a update version later.
          Hide
          Yin Huai added a comment -

          I tested three queries. TPC-H Q17, Q18 and the left-outer-join sub-tree in the Q21. You can check the query plan trees in the paper http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

          Here are the results.
          disabled(s) enabled(s)
          Q17 1288.917 655.07
          Q18 1731.734 911.761
          Q21 subtree 1865.597 658.58

          Show
          Yin Huai added a comment - I tested three queries. TPC-H Q17, Q18 and the left-outer-join sub-tree in the Q21. You can check the query plan trees in the paper http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf Here are the results. disabled(s) enabled(s) Q17 1288.917 655.07 Q18 1731.734 911.761 Q21 subtree 1865.597 658.58
          Hide
          Yin Huai added a comment -

          The patch is ready. Queries used in testing are included in the file of testQueries.q.

          Show
          Yin Huai added a comment - The patch is ready. Queries used in testing are included in the file of testQueries.q.
          Hide
          Yin Huai added a comment -

          ok. I will start to cleanup my code and upload updated patch soon.

          Show
          Yin Huai added a comment - ok. I will start to cleanup my code and upload updated patch soon.
          Hide
          He Yongqiang added a comment -

          ok. how about just "correlation"?
          Also can you take a look if it is possible to the optimization as part of physical optimizer. We need a lot of code cleanup in the current patch.

          Show
          He Yongqiang added a comment - ok. how about just "correlation"? Also can you take a look if it is possible to the optimization as part of physical optimizer. We need a lot of code cleanup in the current patch.
          Hide
          Yin Huai added a comment -

          Yongqiang: I will change the name of optimizer. But, I'd prefer the name like "query correlation detector" or "multi-query optimizer", because I think that the name of "cooperative scan" limits the scope of this optimizer. Besides shared-scan, if ReduceSinkOperators of two chained Hive-generated MapReduce jobs share the same key(s), this optimizer can merge the second job into the reduce phase of the first job.

          I will upload a patch by this Sunday.

          Show
          Yin Huai added a comment - Yongqiang: I will change the name of optimizer. But, I'd prefer the name like "query correlation detector" or "multi-query optimizer", because I think that the name of "cooperative scan" limits the scope of this optimizer. Besides shared-scan, if ReduceSinkOperators of two chained Hive-generated MapReduce jobs share the same key(s), this optimizer can merge the second job into the reduce phase of the first job. I will upload a patch by this Sunday.
          Hide
          He Yongqiang added a comment -

          Cool! Yin, please let us know when u are mostly done. one small things is that in the hive code let's call the new optimizer as "cooperative scan" instead of YSmart. But we can add the paper ref in the comment.

          Show
          He Yongqiang added a comment - Cool! Yin, please let us know when u are mostly done. one small things is that in the hive code let's call the new optimizer as "cooperative scan" instead of YSmart. But we can add the paper ref in the comment.
          Hide
          Yin Huai added a comment -

          Almost finish the patch. I did a preliminary test based on TPC-H Q17 and Q18. My machine has a quad-core Intel Xeon X3220 processor (2.4 GHz), 4GB of RAM, a 500GB hard disk and Ubuntu 11.04. With scale factor 10, the execution time of Q17 is 1216.94s without the patch versus 713.581s with the patch, and that of Q18 is 1737.18s without the patch versus 867.334s with the patch.

          I am facing a issue which I have not found a good way to solve. Suppose that we have a query "SELECT * FROM (SELECT L.c1 as c11, R.c2 as c12 FROM L JOIN R ON L.c1=R.C2) t1 JOIN (SELECT R.c1 as c21, count(distinct R.c2) as c22 FROM R GROUP BY R.c1) ON t1.c11=t2.c21". In this query, only one MapReduce job is necessary. However, because Hive will use R.c1 and R.c2 as the key columns of the original ReduceSinkOperator for the sub-query involving distinct count function, it is impossible to merged MapReduce jobs of two sub-queries into one. To optimize this kind of query, I write a new UDF function count_distinct(...), e.g. count_distinct(R.c2). This count_distinct function use a HashSet to get the number of distinct records. Is there any better solution for optimizing this kind of queries? Thanks.

          Show
          Yin Huai added a comment - Almost finish the patch. I did a preliminary test based on TPC-H Q17 and Q18. My machine has a quad-core Intel Xeon X3220 processor (2.4 GHz), 4GB of RAM, a 500GB hard disk and Ubuntu 11.04. With scale factor 10, the execution time of Q17 is 1216.94s without the patch versus 713.581s with the patch, and that of Q18 is 1737.18s without the patch versus 867.334s with the patch. I am facing a issue which I have not found a good way to solve. Suppose that we have a query "SELECT * FROM (SELECT L.c1 as c11, R.c2 as c12 FROM L JOIN R ON L.c1=R.C2) t1 JOIN (SELECT R.c1 as c21, count(distinct R.c2) as c22 FROM R GROUP BY R.c1) ON t1.c11=t2.c21". In this query, only one MapReduce job is necessary. However, because Hive will use R.c1 and R.c2 as the key columns of the original ReduceSinkOperator for the sub-query involving distinct count function, it is impossible to merged MapReduce jobs of two sub-queries into one. To optimize this kind of query, I write a new UDF function count_distinct(...), e.g. count_distinct(R.c2). This count_distinct function use a HashSet to get the number of distinct records. Is there any better solution for optimizing this kind of queries? Thanks.
          Hide
          Yin Huai added a comment -

          The current optimizer can identify correlations with query plan tree structures like TPC-H Q17 (in attached file Queries). Using Q17 as an example, sub-query (denoted as sub-Q1 and originally executed by MapReduce job J1) "SELECT l_partkey as t_partkey, 0.2 * avg(l_quantity) AS t_avg_quantity FROM lineitem GROUP BY l_partkey" has correlation with sub-query (denoted as sub-Q2 and originally executed by MapReduce job J2) "SELECT l_quantity, l_partkey, l_extendedprice FROM part p JOIN lineitem l ON p.p_partkey = l.l_partkey AND p.p_brand = 'Brand#52' AND p.p_container = 'JUMBO CAN'", because (1)sub-Q1 and sub-Q2 share the same input 'lineitem'; (2) ReduceSinkOperators in J1 and J2 share the same 'key', which is l_partkey (p_partkey). Also, because intermediate tables generated by sub-Q1 and sub-Q2 will be joined by a MapReduce job J3, of which the 'key' of ReduceSinkOperator is 'l_partkey', J3 has correlation with J1 and J2. Thus, J1, J2 and J3 can be merged into one MapReduce job J'. In the map function of J', a composite operator will be used to execute FilterOperators (if any) for sub-Q1 and sub-Q2. Then, in the reduce function of J', a dispatch operator is used to dispatch reduce-input records to JoinOperator in J1 and GroupByOperator in J2. Then, the results of JoinOperator and GroupByOperator will be fed to the JoinOperator in J3.

          For this optimizer, there are several issues.

          1: Because for the MapReduce job executing correlated MapReduce jobs, intermediate key/value pairs will be consumed by multiple operators, Map-side Aggregation is disabled.
          2: For the MapReduce job executing correlated MapReduce jobs, if the depth of execution path in the reduce function is not the same (for example "SELECT * FROM lineitem l1 JOIN (SELECT l_partkey FROM part p JOIN lineitem l ON p.p_partkey = l.l_partkey) tmp ON l1.partkey = tmp.partkey"), one or multiple YSmartForwardOperator should be used. I have not completely solved this issue.
          3: For two independent MapReduce jobs J1 and J2, the current correlation identifier only searches ReduceSinkOperators with the same 'key(s)' for correlation, actually the set of 'key(s)' of the ReduceSinkOperator in J1 is a subset of that in J2, these two MapReduce jobs are correlated. (Also, sub-queries with distinct keyword associated with Group By clause is under this issue, since distinct keyword is handled by using all columns as 'keys' in its corresponding ReduceSinkOperator)
          4: The current correlation identifier can not identify correlations represented by columns involving "max(<column name>)" or "min(<column name>)".

          I will start working on this optimizer in August and will firstly solve issues 2-4 mentioned above.

          Show
          Yin Huai added a comment - The current optimizer can identify correlations with query plan tree structures like TPC-H Q17 (in attached file Queries). Using Q17 as an example, sub-query (denoted as sub-Q1 and originally executed by MapReduce job J1) "SELECT l_partkey as t_partkey, 0.2 * avg(l_quantity) AS t_avg_quantity FROM lineitem GROUP BY l_partkey" has correlation with sub-query (denoted as sub-Q2 and originally executed by MapReduce job J2) "SELECT l_quantity, l_partkey, l_extendedprice FROM part p JOIN lineitem l ON p.p_partkey = l.l_partkey AND p.p_brand = 'Brand#52' AND p.p_container = 'JUMBO CAN'", because (1)sub-Q1 and sub-Q2 share the same input 'lineitem'; (2) ReduceSinkOperators in J1 and J2 share the same 'key', which is l_partkey (p_partkey). Also, because intermediate tables generated by sub-Q1 and sub-Q2 will be joined by a MapReduce job J3, of which the 'key' of ReduceSinkOperator is 'l_partkey', J3 has correlation with J1 and J2. Thus, J1, J2 and J3 can be merged into one MapReduce job J'. In the map function of J', a composite operator will be used to execute FilterOperators (if any) for sub-Q1 and sub-Q2. Then, in the reduce function of J', a dispatch operator is used to dispatch reduce-input records to JoinOperator in J1 and GroupByOperator in J2. Then, the results of JoinOperator and GroupByOperator will be fed to the JoinOperator in J3. For this optimizer, there are several issues. 1: Because for the MapReduce job executing correlated MapReduce jobs, intermediate key/value pairs will be consumed by multiple operators, Map-side Aggregation is disabled. 2: For the MapReduce job executing correlated MapReduce jobs, if the depth of execution path in the reduce function is not the same (for example "SELECT * FROM lineitem l1 JOIN (SELECT l_partkey FROM part p JOIN lineitem l ON p.p_partkey = l.l_partkey) tmp ON l1.partkey = tmp.partkey"), one or multiple YSmartForwardOperator should be used. I have not completely solved this issue. 3: For two independent MapReduce jobs J1 and J2, the current correlation identifier only searches ReduceSinkOperators with the same 'key(s)' for correlation, actually the set of 'key(s)' of the ReduceSinkOperator in J1 is a subset of that in J2, these two MapReduce jobs are correlated. (Also, sub-queries with distinct keyword associated with Group By clause is under this issue, since distinct keyword is handled by using all columns as 'keys' in its corresponding ReduceSinkOperator) 4: The current correlation identifier can not identify correlations represented by columns involving "max(<column name>)" or "min(<column name>)". I will start working on this optimizer in August and will firstly solve issues 2-4 mentioned above.
          Hide
          Yin Huai added a comment -

          Two queries (TPC-H Q17 and TPC-H Q18) can be used for testing this optimizer. Q17 is the same with the query provided in https://issues.apache.org/jira/browse/HIVE-600, but to expose the correlation, Q18 is modified. With this optimizer, Q17 and Q18 needs 2 and 4 MapReduce jobs, respectively. Without this optimizer, these two queries need 4 and 8 MapReduce jobs, respectively.

          Show
          Yin Huai added a comment - Two queries (TPC-H Q17 and TPC-H Q18) can be used for testing this optimizer. Q17 is the same with the query provided in https://issues.apache.org/jira/browse/HIVE-600 , but to expose the correlation, Q18 is modified. With this optimizer, Q17 and Q18 needs 2 and 4 MapReduce jobs, respectively. Without this optimizer, these two queries need 4 and 8 MapReduce jobs, respectively.
          Hide
          He Yongqiang added a comment -

          a draft patch

          will submit revised version later.

          Show
          He Yongqiang added a comment - a draft patch will submit revised version later.

            People

            • Assignee:
              Yin Huai
              Reporter:
              He Yongqiang
            • Votes:
              0 Vote for this issue
              Watchers:
              40 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development