Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11.0
    • Component/s: Query Processor
    • Labels:

      Description

      Before implementing optimizer for JOIN-GBY, try to implement RS-GBY optimizer(cluster-by following group-by).

      1. HIVE-2340.1.patch.txt
        31 kB
        Navis
      2. ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.1.patch
        58 kB
        Phabricator
      3. ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.2.patch
        57 kB
        Phabricator
      4. ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.3.patch
        59 kB
        Phabricator
      5. ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.4.patch
        249 kB
        Phabricator
      6. ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.5.patch
        299 kB
        Phabricator
      7. HIVE-2340.D1209.6.patch
        307 kB
        Phabricator
      8. HIVE-2340.D1209.7.patch
        353 kB
        Phabricator
      9. HIVE-2340.D1209.8.patch
        191 kB
        Phabricator
      10. HIVE-2340.D1209.9.patch
        196 kB
        Phabricator
      11. testclidriver.txt
        27 kB
        Gunther Hagleitner
      12. HIVE-2340.D1209.10.patch
        208 kB
        Phabricator
      13. HIVE-2340.D1209.11.patch
        203 kB
        Phabricator
      14. HIVE-2340.12.patch
        228 kB
        Gunther Hagleitner
      15. HIVE-2340.13.patch
        242 kB
        Gunther Hagleitner
      16. HIVE-2340.D1209.12.patch
        286 kB
        Phabricator
      17. HIVE-2340.D1209.13.patch
        422 kB
        Phabricator
      18. HIVE-2340.14.patch
        239 kB
        Gunther Hagleitner
      19. HIVE-2340.D1209.14.patch
        212 kB
        Phabricator
      20. HIVE-2340.14.rebased_and_schema_clone.patch
        215 kB
        Gunther Hagleitner
      21. HIVE-2340.D1209.15.patch
        212 kB
        Phabricator
      22. HIVE-2340.15.patch
        216 kB
        Navis

        Issue Links

          Activity

          Hide
          Navis added a comment -

          I need the key.

          Show
          Navis added a comment - I need the key.
          Hide
          Navis added a comment -

          initial patch

          Show
          Navis added a comment - initial patch
          Hide
          Navis added a comment -

          extended to accept general key expressions including UDFs

          Show
          Navis added a comment - extended to accept general key expressions including UDFs
          Hide
          Navis added a comment -
          Show
          Navis added a comment - https://reviews.apache.org/r/2991/
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2991/
          -----------------------------------------------------------

          (Updated 2011-12-06 11:02:00.777597)

          Review request for hive and Carl Steinbach.

          Changes
          -------

          I've overrode existing optimizer by mistake. Fixed it.

          Summary
          -------

          Mostly copied from existing code. Not tested intensively yet, but it is seemed to be used frequently for us.

          This addresses bug HIVE-2340.
          https://issues.apache.org/jira/browse/HIVE-2340

          Diffs (updated)


          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 82a141d
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java e91b4d5
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q PRE-CREATION
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out PRE-CREATION

          Diff: https://reviews.apache.org/r/2991/diff

          Testing
          -------

          new test cases added : reduce_deduplicate_extended.q

          Thanks,

          Navis

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2991/ ----------------------------------------------------------- (Updated 2011-12-06 11:02:00.777597) Review request for hive and Carl Steinbach. Changes ------- I've overrode existing optimizer by mistake. Fixed it. Summary ------- Mostly copied from existing code. Not tested intensively yet, but it is seemed to be used frequently for us. This addresses bug HIVE-2340 . https://issues.apache.org/jira/browse/HIVE-2340 Diffs (updated) ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 82a141d ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java e91b4d5 ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q PRE-CREATION ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out PRE-CREATION Diff: https://reviews.apache.org/r/2991/diff Testing ------- new test cases added : reduce_deduplicate_extended.q Thanks, Navis
          Hide
          Phabricator added a comment -

          navis requested code review of "HIVE-2340 [jira] optimize orderby followed by a groupby".
          Reviewers: JIRA

          DPAL-592 optimize orderby followed by a groupby

          Before implementing optimizer for JOIN-GBY, try to implement RS-GBY optimizer(cluster-by following group-by).

          TEST PLAN
          EMPTY

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExplainTask.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out

          MANAGE HERALD DIFFERENTIAL RULES
          https://reviews.facebook.net/herald/view/differential/

          WHY DID I GET THIS EMAIL?
          https://reviews.facebook.net/herald/transcript/2499/

          Tip: use the X-Herald-Rules header to filter Herald messages in your client.

          Show
          Phabricator added a comment - navis requested code review of " HIVE-2340 [jira] optimize orderby followed by a groupby". Reviewers: JIRA DPAL-592 optimize orderby followed by a groupby Before implementing optimizer for JOIN-GBY, try to implement RS-GBY optimizer(cluster-by following group-by). TEST PLAN EMPTY REVISION DETAIL https://reviews.facebook.net/D1209 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/exec/ExplainTask.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out MANAGE HERALD DIFFERENTIAL RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/2499/ Tip: use the X-Herald-Rules header to filter Herald messages in your client.
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".
          Reviewers: JIRA

          Fixed failure of join18.q (was RS-GBY-RS-JOIN case).
          Also implemented JOIN-GBY optimizer with map aggregation which improved performace for TPC-H Q13 slightly.

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Reviewers: JIRA Fixed failure of join18.q (was RS-GBY-RS-JOIN case). Also implemented JOIN-GBY optimizer with map aggregation which improved performace for TPC-H Q13 slightly. REVISION DETAIL https://reviews.facebook.net/D1209 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".
          Reviewers: JIRA

          fix test failures : auto_join26.q, ql_rewrite_gbtoidx.q, udtf_json_tuple.q, union24.q

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/test/queries/clientpositive/auto_join26.q
          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/queries/clientpositive/udtf_json_tuple.q
          ql/src/test/queries/clientpositive/union24.q
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Reviewers: JIRA fix test failures : auto_join26.q, ql_rewrite_gbtoidx.q, udtf_json_tuple.q, union24.q REVISION DETAIL https://reviews.facebook.net/D1209 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/test/queries/clientpositive/auto_join26.q ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/queries/clientpositive/udtf_json_tuple.q ql/src/test/queries/clientpositive/union24.q ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".
          Reviewers: JIRA

          Deduplicates RS-RS more aggressively.

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/auto_join0.q.out
          ql/src/test/results/clientpositive/auto_join10.q.out
          ql/src/test/results/clientpositive/auto_join11.q.out
          ql/src/test/results/clientpositive/auto_join12.q.out
          ql/src/test/results/clientpositive/auto_join13.q.out
          ql/src/test/results/clientpositive/auto_join15.q.out
          ql/src/test/results/clientpositive/auto_join16.q.out
          ql/src/test/results/clientpositive/auto_join18.q.out
          ql/src/test/results/clientpositive/auto_join18_multi_distinct.q.out
          ql/src/test/results/clientpositive/auto_join20.q.out
          ql/src/test/results/clientpositive/auto_join22.q.out
          ql/src/test/results/clientpositive/auto_join24.q.out
          ql/src/test/results/clientpositive/auto_join26.q.out
          ql/src/test/results/clientpositive/auto_join27.q.out
          ql/src/test/results/clientpositive/auto_join30.q.out
          ql/src/test/results/clientpositive/auto_join31.q.out
          ql/src/test/results/clientpositive/ppd_gby_join.q.out
          ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          ql/src/test/results/clientpositive/udtf_json_tuple.q.out
          ql/src/test/results/clientpositive/union24.q.out

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Reviewers: JIRA Deduplicates RS-RS more aggressively. REVISION DETAIL https://reviews.facebook.net/D1209 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/auto_join0.q.out ql/src/test/results/clientpositive/auto_join10.q.out ql/src/test/results/clientpositive/auto_join11.q.out ql/src/test/results/clientpositive/auto_join12.q.out ql/src/test/results/clientpositive/auto_join13.q.out ql/src/test/results/clientpositive/auto_join15.q.out ql/src/test/results/clientpositive/auto_join16.q.out ql/src/test/results/clientpositive/auto_join18.q.out ql/src/test/results/clientpositive/auto_join18_multi_distinct.q.out ql/src/test/results/clientpositive/auto_join20.q.out ql/src/test/results/clientpositive/auto_join22.q.out ql/src/test/results/clientpositive/auto_join24.q.out ql/src/test/results/clientpositive/auto_join26.q.out ql/src/test/results/clientpositive/auto_join27.q.out ql/src/test/results/clientpositive/auto_join30.q.out ql/src/test/results/clientpositive/auto_join31.q.out ql/src/test/results/clientpositive/ppd_gby_join.q.out ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out ql/src/test/results/clientpositive/udtf_json_tuple.q.out ql/src/test/results/clientpositive/union24.q.out
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".
          Reviewers: JIRA

          Optimizer for GroupBy-GroupBy case added and several bug fixes.

          Updated test results are affected by this optimizer, reducing MR stages. Confirmed query result values are same.
          Except that, all tests are passes, except ppr_pudown.q (I'm working on it, too)

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/auto_join0.q.out
          ql/src/test/results/clientpositive/auto_join10.q.out
          ql/src/test/results/clientpositive/auto_join11.q.out
          ql/src/test/results/clientpositive/auto_join12.q.out
          ql/src/test/results/clientpositive/auto_join13.q.out
          ql/src/test/results/clientpositive/auto_join15.q.out
          ql/src/test/results/clientpositive/auto_join16.q.out
          ql/src/test/results/clientpositive/auto_join18.q.out
          ql/src/test/results/clientpositive/auto_join18_multi_distinct.q.out
          ql/src/test/results/clientpositive/auto_join20.q.out
          ql/src/test/results/clientpositive/auto_join22.q.out
          ql/src/test/results/clientpositive/auto_join24.q.out
          ql/src/test/results/clientpositive/auto_join26.q.out
          ql/src/test/results/clientpositive/auto_join27.q.out
          ql/src/test/results/clientpositive/auto_join30.q.out
          ql/src/test/results/clientpositive/auto_join31.q.out
          ql/src/test/results/clientpositive/index_bitmap3.q.out
          ql/src/test/results/clientpositive/index_bitmap_auto.q.out
          ql/src/test/results/clientpositive/input39.q.out
          ql/src/test/results/clientpositive/join40.q.out
          ql/src/test/results/clientpositive/metadataonly1.q.out
          ql/src/test/results/clientpositive/ppd_gby_join.q.out
          ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          ql/src/test/results/clientpositive/udtf_json_tuple.q.out
          ql/src/test/results/clientpositive/union24.q.out
          ql/src/test/results/compiler/plan/join2.q.xml

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Reviewers: JIRA Optimizer for GroupBy-GroupBy case added and several bug fixes. Updated test results are affected by this optimizer, reducing MR stages. Confirmed query result values are same. Except that, all tests are passes, except ppr_pudown.q (I'm working on it, too) REVISION DETAIL https://reviews.facebook.net/D1209 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/auto_join0.q.out ql/src/test/results/clientpositive/auto_join10.q.out ql/src/test/results/clientpositive/auto_join11.q.out ql/src/test/results/clientpositive/auto_join12.q.out ql/src/test/results/clientpositive/auto_join13.q.out ql/src/test/results/clientpositive/auto_join15.q.out ql/src/test/results/clientpositive/auto_join16.q.out ql/src/test/results/clientpositive/auto_join18.q.out ql/src/test/results/clientpositive/auto_join18_multi_distinct.q.out ql/src/test/results/clientpositive/auto_join20.q.out ql/src/test/results/clientpositive/auto_join22.q.out ql/src/test/results/clientpositive/auto_join24.q.out ql/src/test/results/clientpositive/auto_join26.q.out ql/src/test/results/clientpositive/auto_join27.q.out ql/src/test/results/clientpositive/auto_join30.q.out ql/src/test/results/clientpositive/auto_join31.q.out ql/src/test/results/clientpositive/index_bitmap3.q.out ql/src/test/results/clientpositive/index_bitmap_auto.q.out ql/src/test/results/clientpositive/input39.q.out ql/src/test/results/clientpositive/join40.q.out ql/src/test/results/clientpositive/metadataonly1.q.out ql/src/test/results/clientpositive/ppd_gby_join.q.out ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out ql/src/test/results/clientpositive/udtf_json_tuple.q.out ql/src/test/results/clientpositive/union24.q.out ql/src/test/results/compiler/plan/join2.q.xml
          Hide
          Ashutosh Chauhan added a comment -

          Navis,
          This is a useful optimization. Patch doesn't apply cleanly. If you update the patch, I will take a look.

          Show
          Ashutosh Chauhan added a comment - Navis, This is a useful optimization. Patch doesn't apply cleanly. If you update the patch, I will take a look.
          Hide
          Navis added a comment -

          @Ashutosh
          I've abandoned working on this cause HIVE-2206 is superset of it. Would it be better to do this as a brief mending before that?

          Show
          Navis added a comment - @Ashutosh I've abandoned working on this cause HIVE-2206 is superset of it. Would it be better to do this as a brief mending before that?
          Hide
          Ashutosh Chauhan added a comment -

          Navis I played with HIVE-2206 over last couple of days and I think your work is complimentary and useful because there are several cases which this patch is already optimizing which HIVE-2206 doesn't currently. For e.g., order-by followed by groupby, RS followed by GBY, JOIN-GROUPBY, RS-RS, GBY-GBY. This covers a lot of ground, so I suggest we move forward with this.

          Show
          Ashutosh Chauhan added a comment - Navis I played with HIVE-2206 over last couple of days and I think your work is complimentary and useful because there are several cases which this patch is already optimizing which HIVE-2206 doesn't currently. For e.g., order-by followed by groupby, RS followed by GBY, JOIN-GROUPBY, RS-RS, GBY-GBY. This covers a lot of ground, so I suggest we move forward with this.
          Hide
          Yin Huai added a comment -

          Ashutosh Chauhan Can you let me know cases of JOIN-GROUPBY and GBY-GBY which 2206 does not optimize? Also, does RS refer to those queries with "cluster by", "sort by" or "distributed by"? Thanks.

          Show
          Yin Huai added a comment - Ashutosh Chauhan Can you let me know cases of JOIN-GROUPBY and GBY-GBY which 2206 does not optimize? Also, does RS refer to those queries with "cluster by", "sort by" or "distributed by"? Thanks.
          Hide
          Ashutosh Chauhan added a comment -

          Yeah, correct JOIN-GBY and GBY-GBY are taken care of in ysmart also. Its the group-by followed by order-by case which is also of interest to me, which this already covers.

          Besides the scenario covered by these two patches, I am also comparing the approaches taken in these two. I have just briefly looked at this patch, but fundamental difference which I can make out in this approach Vs ysmart approach is that here RS is deduplicated that is completely removed from operator pipeline, wherever it could be (i.e. when keys of subsequent RS is superset of the earlier one) thus fusing multiple MR jobs. Ysmart on the other hand instead replaces the second RS with a new operator its introducing (LocalSimulatedReduceSink?) which fakes the RS but doesn't let the plan split in 2 MR jobs and thus generating one MR job. I haven't thought through completely on this, but on initial pass it seems like approach of this patch is better than ysmart because:

          • Here you don't need a new operator.
          • Here you are simplifying the plan by eliminating the operators as oppose to ysmart which is replacing the operator thereby increasing the complexity of plan (by having a new type of operator)
          • In that new operator ysmart currently serializes and deserializes the data through that operator, thereby unnecessarily introducing performance penalty. Granted this could be improved, but problem doesn't exist in patch proposed on this jira to begin with.

          Though there are certainly other scenarios which ysmart can cover (Yin, can you list those) which this patch is not covering, but for the scenarios that are common this approach seems to be better.

          There might be other differences in the approach, please feel free to raise those.

          Show
          Ashutosh Chauhan added a comment - Yeah, correct JOIN-GBY and GBY-GBY are taken care of in ysmart also. Its the group-by followed by order-by case which is also of interest to me, which this already covers. Besides the scenario covered by these two patches, I am also comparing the approaches taken in these two. I have just briefly looked at this patch, but fundamental difference which I can make out in this approach Vs ysmart approach is that here RS is deduplicated that is completely removed from operator pipeline, wherever it could be (i.e. when keys of subsequent RS is superset of the earlier one) thus fusing multiple MR jobs. Ysmart on the other hand instead replaces the second RS with a new operator its introducing (LocalSimulatedReduceSink?) which fakes the RS but doesn't let the plan split in 2 MR jobs and thus generating one MR job. I haven't thought through completely on this, but on initial pass it seems like approach of this patch is better than ysmart because: Here you don't need a new operator. Here you are simplifying the plan by eliminating the operators as oppose to ysmart which is replacing the operator thereby increasing the complexity of plan (by having a new type of operator) In that new operator ysmart currently serializes and deserializes the data through that operator, thereby unnecessarily introducing performance penalty. Granted this could be improved, but problem doesn't exist in patch proposed on this jira to begin with. Though there are certainly other scenarios which ysmart can cover (Yin, can you list those) which this patch is not covering, but for the scenarios that are common this approach seems to be better. There might be other differences in the approach, please feel free to raise those.
          Hide
          Yin Huai added a comment -

          Let me explain the reason that I introduced the fake RS operator instead of just removing the original RS. When I was developing the patch for 2206, I found that the aggregation operator (GBY) and the join operator (JOIN) use different logic on processing rows forwarded to it. Although they both buffer rows, a GBY determines if it need to forward results to its children in processOp. While, a JOIN replies on endGroup to know when it should forward results. When we have plans like GBY-GBY or JOIN-GBY, that difference on processing logic is fine. However, when we have plan like

          GBY----                    GBY----
                 \                          \
                  ----JOIN    or             ----JOIN
                 /                          /
          GBY----                    JOIN---
          

          We need operators between the child JOIN and parent GBYs and JOINs to make sure JOIN process rows in a correct way. This is also the reason that in CorrelationLocalSimulativeReduceSinkOperator, it determines when to start the group of its children in processOp and leave a empty startGroup and endGroup.

          Also, by replacing RSs with those fake RSs, I do not need to touch those GBYs and JOINs which will be merged into the same Reduce phase. Since the input of the first operator in the Reduce side is in the format of [key, value, tag], so I use those fake RSs to generate rows in the same format.

          But this part of work was implemented about almost 2 years ago. Definitely let me know if anything has been changed and this fake RS is no longer needed.

          Show
          Yin Huai added a comment - Let me explain the reason that I introduced the fake RS operator instead of just removing the original RS. When I was developing the patch for 2206, I found that the aggregation operator (GBY) and the join operator (JOIN) use different logic on processing rows forwarded to it. Although they both buffer rows, a GBY determines if it need to forward results to its children in processOp. While, a JOIN replies on endGroup to know when it should forward results. When we have plans like GBY-GBY or JOIN-GBY, that difference on processing logic is fine. However, when we have plan like GBY---- GBY---- \ \ ----JOIN or ----JOIN / / GBY---- JOIN--- We need operators between the child JOIN and parent GBYs and JOINs to make sure JOIN process rows in a correct way. This is also the reason that in CorrelationLocalSimulativeReduceSinkOperator, it determines when to start the group of its children in processOp and leave a empty startGroup and endGroup. Also, by replacing RSs with those fake RSs, I do not need to touch those GBYs and JOINs which will be merged into the same Reduce phase. Since the input of the first operator in the Reduce side is in the format of [key, value, tag] , so I use those fake RSs to generate rows in the same format. But this part of work was implemented about almost 2 years ago. Definitely let me know if anything has been changed and this fake RS is no longer needed.
          Hide
          Yin Huai added a comment -

          The current implementation of the patch of YSmart covers scenarios when a join or aggregation operator share the same partition keys with its all parents (join or aggregation operators).
          For example, a single MR job will be generated if all operators in the following plan share the same partition keys.

          JOIN----                    
                 \                          
                  ----JOIN----  
                 /            \              
          GBY----              \
                                ----JOIN
                               /
          GBY--- -------------/
          

          Also, it requires that the bottom join or aggregation operators which will be processed in the same MR job take input tables instead of intermediate tables. In future, it should be extended to cover scenarios that involve intermediate tables, that correlated operators share common partition keys (not exactly the same keys), and that a join or aggregation operator share common keys with some of its parents.

          Show
          Yin Huai added a comment - The current implementation of the patch of YSmart covers scenarios when a join or aggregation operator share the same partition keys with its all parents (join or aggregation operators). For example, a single MR job will be generated if all operators in the following plan share the same partition keys. JOIN---- \ ----JOIN---- / \ GBY---- \ ----JOIN / GBY--- -------------/ Also, it requires that the bottom join or aggregation operators which will be processed in the same MR job take input tables instead of intermediate tables. In future, it should be extended to cover scenarios that involve intermediate tables, that correlated operators share common partition keys (not exactly the same keys), and that a join or aggregation operator share common keys with some of its parents.
          Hide
          Ashutosh Chauhan added a comment -

          Thanks Yin for explaining. Your ASCII art helped in understanding the differences : ) I better understand the reason for the fake new operator now. I think in cases you have pointed out when there is such kind of trees, this reduce deduplication approach won't help, since it looks at linear chain of RS and eliminates the one where it could. You would need a fake operator in such case because you don't want to modify the GBY or Join operators which make sense. I see the merits of Ysmart better now.

          Though, on the other hand patch on this jira is still useful and complementary to ysmart. Since, it will collapse linear RS, instead of adding fake ones. In addition to collapsing of those operators, it will also make the life of ysmart easier because than ysmart will be dealing with simpler plans with reduce sinks already deduplicated. We need to make sure reducededup rule fires before ysmart for both optimizations to play nicely. So, I think we should make progress on both these patches.

          Navis Will you like to refresh this patch?

          Show
          Ashutosh Chauhan added a comment - Thanks Yin for explaining. Your ASCII art helped in understanding the differences : ) I better understand the reason for the fake new operator now. I think in cases you have pointed out when there is such kind of trees, this reduce deduplication approach won't help, since it looks at linear chain of RS and eliminates the one where it could. You would need a fake operator in such case because you don't want to modify the GBY or Join operators which make sense. I see the merits of Ysmart better now. Though, on the other hand patch on this jira is still useful and complementary to ysmart. Since, it will collapse linear RS, instead of adding fake ones. In addition to collapsing of those operators, it will also make the life of ysmart easier because than ysmart will be dealing with simpler plans with reduce sinks already deduplicated. We need to make sure reducededup rule fires before ysmart for both optimizations to play nicely. So, I think we should make progress on both these patches. Navis Will you like to refresh this patch?
          Hide
          Yin Huai added a comment -

          Yes, I agree.

          Show
          Yin Huai added a comment - Yes, I agree.
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".
          Reviewers: JIRA

          Rebased to trunk but I can't sure of this

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/auto_join0.q.out
          ql/src/test/results/clientpositive/auto_join10.q.out
          ql/src/test/results/clientpositive/auto_join11.q.out
          ql/src/test/results/clientpositive/auto_join12.q.out
          ql/src/test/results/clientpositive/auto_join13.q.out
          ql/src/test/results/clientpositive/auto_join15.q.out
          ql/src/test/results/clientpositive/auto_join16.q.out
          ql/src/test/results/clientpositive/auto_join18.q.out
          ql/src/test/results/clientpositive/auto_join18_multi_distinct.q.out
          ql/src/test/results/clientpositive/auto_join20.q.out
          ql/src/test/results/clientpositive/auto_join22.q.out
          ql/src/test/results/clientpositive/auto_join24.q.out
          ql/src/test/results/clientpositive/auto_join26.q.out
          ql/src/test/results/clientpositive/auto_join27.q.out
          ql/src/test/results/clientpositive/auto_join30.q.out
          ql/src/test/results/clientpositive/auto_join31.q.out
          ql/src/test/results/clientpositive/index_bitmap3.q.out
          ql/src/test/results/clientpositive/index_bitmap_auto.q.out
          ql/src/test/results/clientpositive/join40.q.out
          ql/src/test/results/clientpositive/metadataonly1.q.out
          ql/src/test/results/clientpositive/ppd_gby_join.q.out
          ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          ql/src/test/results/clientpositive/udtf_json_tuple.q.out
          ql/src/test/results/clientpositive/union24.q.out
          ql/src/test/results/compiler/plan/join2.q.xml

          To: JIRA, navis

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Reviewers: JIRA Rebased to trunk but I can't sure of this REVISION DETAIL https://reviews.facebook.net/D1209 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/auto_join0.q.out ql/src/test/results/clientpositive/auto_join10.q.out ql/src/test/results/clientpositive/auto_join11.q.out ql/src/test/results/clientpositive/auto_join12.q.out ql/src/test/results/clientpositive/auto_join13.q.out ql/src/test/results/clientpositive/auto_join15.q.out ql/src/test/results/clientpositive/auto_join16.q.out ql/src/test/results/clientpositive/auto_join18.q.out ql/src/test/results/clientpositive/auto_join18_multi_distinct.q.out ql/src/test/results/clientpositive/auto_join20.q.out ql/src/test/results/clientpositive/auto_join22.q.out ql/src/test/results/clientpositive/auto_join24.q.out ql/src/test/results/clientpositive/auto_join26.q.out ql/src/test/results/clientpositive/auto_join27.q.out ql/src/test/results/clientpositive/auto_join30.q.out ql/src/test/results/clientpositive/auto_join31.q.out ql/src/test/results/clientpositive/index_bitmap3.q.out ql/src/test/results/clientpositive/index_bitmap_auto.q.out ql/src/test/results/clientpositive/join40.q.out ql/src/test/results/clientpositive/metadataonly1.q.out ql/src/test/results/clientpositive/ppd_gby_join.q.out ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out ql/src/test/results/clientpositive/udtf_json_tuple.q.out ql/src/test/results/clientpositive/union24.q.out ql/src/test/results/compiler/plan/join2.q.xml To: JIRA, navis
          Hide
          Ashutosh Chauhan added a comment -

          Thanks, Navis for updating the patch. I haven't looked the patch in detail, but some comments:

          • In the latest patch, you have removed rule RS%.RS%GBY% and RS%GBY%.*RS%GBY% and have modified rule JOIN%.*RS%GBY% to JOIN%.%RS% Can you shed some light on thinking behind picking these rules? Were those rules not stable or you think they are not useful?
          • auto_join_26.q.out is incorrect, its generating wrong results. Looks like aggregation is not happening correctly.
          • I haven't ran full suite, but queries groupby_grouping_sets5.q and smb_mapjoin_14.q are failing after applying this patch.
          Show
          Ashutosh Chauhan added a comment - Thanks, Navis for updating the patch. I haven't looked the patch in detail, but some comments: In the latest patch, you have removed rule RS%. RS%GBY% and RS%GBY%.*RS%GBY% and have modified rule JOIN%.*RS%GBY% to JOIN%. %RS% Can you shed some light on thinking behind picking these rules? Were those rules not stable or you think they are not useful? auto_join_26.q.out is incorrect, its generating wrong results. Looks like aggregation is not happening correctly. I haven't ran full suite, but queries groupby_grouping_sets5.q and smb_mapjoin_14.q are failing after applying this patch.
          Hide
          Navis added a comment -

          These are not removed but just merged with other rules. For example RS-RS/GBY is currently handled by RS-RS which is now diverged by type of child of child RS. It's more simple and seemed safer because this optimizer removes or replaces child RS and following GBY if possible. But this means child RS can be removed/replaced before accessed by graph walker.

          I've ran about twenty tests before submitting. I wish I had some time to run all pending tests but a little busy recently, sorry.

          Show
          Navis added a comment - These are not removed but just merged with other rules. For example RS- RS/GBY is currently handled by RS -RS which is now diverged by type of child of child RS. It's more simple and seemed safer because this optimizer removes or replaces child RS and following GBY if possible. But this means child RS can be removed/replaced before accessed by graph walker. I've ran about twenty tests before submitting. I wish I had some time to run all pending tests but a little busy recently, sorry.
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".
          Reviewers: JIRA

          1. Does not try RS dedup when child GBY is for grouping set.
          2. Prevent converting deduped parent JOIN to MAPJOIN (by hive.auto.convert.join=true)

          For case 2, I don't knww what is better to choose deduped-JOIN or MAPJOIN+RS
          With auto_join31.q, deduped-JOIN took 50sec and MAPJOIN+RS took 57sec. But it would be dependent to situation.

          Running test.

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java
          ql/src/test/queries/clientpositive/auto_join16.q
          ql/src/test/queries/clientpositive/auto_join22.q
          ql/src/test/queries/clientpositive/auto_join24.q
          ql/src/test/queries/clientpositive/auto_join26.q
          ql/src/test/queries/clientpositive/auto_join30.q
          ql/src/test/queries/clientpositive/auto_join31.q
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/auto_join0.q.out
          ql/src/test/results/clientpositive/auto_join10.q.out
          ql/src/test/results/clientpositive/auto_join11.q.out
          ql/src/test/results/clientpositive/auto_join12.q.out
          ql/src/test/results/clientpositive/auto_join13.q.out
          ql/src/test/results/clientpositive/auto_join15.q.out
          ql/src/test/results/clientpositive/auto_join16.q.out
          ql/src/test/results/clientpositive/auto_join18.q.out
          ql/src/test/results/clientpositive/auto_join18_multi_distinct.q.out
          ql/src/test/results/clientpositive/auto_join20.q.out
          ql/src/test/results/clientpositive/auto_join22.q.out
          ql/src/test/results/clientpositive/auto_join24.q.out
          ql/src/test/results/clientpositive/auto_join26.q.out
          ql/src/test/results/clientpositive/auto_join27.q.out
          ql/src/test/results/clientpositive/auto_join30.q.out
          ql/src/test/results/clientpositive/auto_join31.q.out
          ql/src/test/results/clientpositive/index_bitmap3.q.out
          ql/src/test/results/clientpositive/index_bitmap_auto.q.out
          ql/src/test/results/clientpositive/join40.q.out
          ql/src/test/results/clientpositive/metadataonly1.q.out
          ql/src/test/results/clientpositive/ppd_gby_join.q.out
          ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          ql/src/test/results/clientpositive/smb_mapjoin_14.q.out
          ql/src/test/results/clientpositive/udtf_json_tuple.q.out
          ql/src/test/results/clientpositive/union24.q.out
          ql/src/test/results/compiler/plan/join2.q.xml

          To: JIRA, navis

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Reviewers: JIRA 1. Does not try RS dedup when child GBY is for grouping set. 2. Prevent converting deduped parent JOIN to MAPJOIN (by hive.auto.convert.join=true) For case 2, I don't knww what is better to choose deduped-JOIN or MAPJOIN+RS With auto_join31.q, deduped-JOIN took 50sec and MAPJOIN+RS took 57sec. But it would be dependent to situation. Running test. REVISION DETAIL https://reviews.facebook.net/D1209 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java ql/src/test/queries/clientpositive/auto_join16.q ql/src/test/queries/clientpositive/auto_join22.q ql/src/test/queries/clientpositive/auto_join24.q ql/src/test/queries/clientpositive/auto_join26.q ql/src/test/queries/clientpositive/auto_join30.q ql/src/test/queries/clientpositive/auto_join31.q ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/auto_join0.q.out ql/src/test/results/clientpositive/auto_join10.q.out ql/src/test/results/clientpositive/auto_join11.q.out ql/src/test/results/clientpositive/auto_join12.q.out ql/src/test/results/clientpositive/auto_join13.q.out ql/src/test/results/clientpositive/auto_join15.q.out ql/src/test/results/clientpositive/auto_join16.q.out ql/src/test/results/clientpositive/auto_join18.q.out ql/src/test/results/clientpositive/auto_join18_multi_distinct.q.out ql/src/test/results/clientpositive/auto_join20.q.out ql/src/test/results/clientpositive/auto_join22.q.out ql/src/test/results/clientpositive/auto_join24.q.out ql/src/test/results/clientpositive/auto_join26.q.out ql/src/test/results/clientpositive/auto_join27.q.out ql/src/test/results/clientpositive/auto_join30.q.out ql/src/test/results/clientpositive/auto_join31.q.out ql/src/test/results/clientpositive/index_bitmap3.q.out ql/src/test/results/clientpositive/index_bitmap_auto.q.out ql/src/test/results/clientpositive/join40.q.out ql/src/test/results/clientpositive/metadataonly1.q.out ql/src/test/results/clientpositive/ppd_gby_join.q.out ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out ql/src/test/results/clientpositive/smb_mapjoin_14.q.out ql/src/test/results/clientpositive/udtf_json_tuple.q.out ql/src/test/results/clientpositive/union24.q.out ql/src/test/results/compiler/plan/join2.q.xml To: JIRA, navis
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".
          Reviewers: JIRA

          1. Fixed big of JOIN-RS cases and prevent deduped join converted to skew-join
          2. Do not try merging RS with defined number of reducers(ORDER BY, etc)
          3. Fixed test results (rolled back some tests)

          Running test

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java
          ql/src/test/queries/clientpositive/auto_join26.q
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/auto_join26.q.out
          ql/src/test/results/clientpositive/cluster.q.out
          ql/src/test/results/clientpositive/groupby2_map_skew.q.out
          ql/src/test/results/clientpositive/groupby_cube1.q.out
          ql/src/test/results/clientpositive/groupby_rollup1.q.out
          ql/src/test/results/clientpositive/index_bitmap3.q.out
          ql/src/test/results/clientpositive/index_bitmap_auto.q.out
          ql/src/test/results/clientpositive/ppd2.q.out
          ql/src/test/results/clientpositive/ppd_gby_join.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          ql/src/test/results/clientpositive/semijoin.q.out
          ql/src/test/results/clientpositive/union24.q.out
          ql/src/test/results/compiler/plan/join2.q.xml

          To: JIRA, navis

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Reviewers: JIRA 1. Fixed big of JOIN-RS cases and prevent deduped join converted to skew-join 2. Do not try merging RS with defined number of reducers(ORDER BY, etc) 3. Fixed test results (rolled back some tests) Running test REVISION DETAIL https://reviews.facebook.net/D1209 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java ql/src/test/queries/clientpositive/auto_join26.q ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/auto_join26.q.out ql/src/test/results/clientpositive/cluster.q.out ql/src/test/results/clientpositive/groupby2_map_skew.q.out ql/src/test/results/clientpositive/groupby_cube1.q.out ql/src/test/results/clientpositive/groupby_rollup1.q.out ql/src/test/results/clientpositive/index_bitmap3.q.out ql/src/test/results/clientpositive/index_bitmap_auto.q.out ql/src/test/results/clientpositive/ppd2.q.out ql/src/test/results/clientpositive/ppd_gby_join.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out ql/src/test/results/clientpositive/semijoin.q.out ql/src/test/results/clientpositive/union24.q.out ql/src/test/results/compiler/plan/join2.q.xml To: JIRA, navis
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".
          Reviewers: JIRA

          1. Fixed test result (miniMR)
          2. Added config setting min number of reducer for merged RS

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java
          ql/src/test/queries/clientpositive/auto_join26.q
          ql/src/test/queries/clientpositive/reduce_deduplicate.q
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/auto_join26.q.out
          ql/src/test/results/clientpositive/cluster.q.out
          ql/src/test/results/clientpositive/groupby2.q.out
          ql/src/test/results/clientpositive/groupby2_map_skew.q.out
          ql/src/test/results/clientpositive/groupby_cube1.q.out
          ql/src/test/results/clientpositive/groupby_rollup1.q.out
          ql/src/test/results/clientpositive/index_bitmap3.q.out
          ql/src/test/results/clientpositive/index_bitmap_auto.q.out
          ql/src/test/results/clientpositive/ppd2.q.out
          ql/src/test/results/clientpositive/ppd_gby_join.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          ql/src/test/results/clientpositive/semijoin.q.out
          ql/src/test/results/clientpositive/union24.q.out
          ql/src/test/results/compiler/plan/join2.q.xml

          To: JIRA, navis

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Reviewers: JIRA 1. Fixed test result (miniMR) 2. Added config setting min number of reducer for merged RS REVISION DETAIL https://reviews.facebook.net/D1209 AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java ql/src/test/queries/clientpositive/auto_join26.q ql/src/test/queries/clientpositive/reduce_deduplicate.q ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/auto_join26.q.out ql/src/test/results/clientpositive/cluster.q.out ql/src/test/results/clientpositive/groupby2.q.out ql/src/test/results/clientpositive/groupby2_map_skew.q.out ql/src/test/results/clientpositive/groupby_cube1.q.out ql/src/test/results/clientpositive/groupby_rollup1.q.out ql/src/test/results/clientpositive/index_bitmap3.q.out ql/src/test/results/clientpositive/index_bitmap_auto.q.out ql/src/test/results/clientpositive/ppd2.q.out ql/src/test/results/clientpositive/ppd_gby_join.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out ql/src/test/results/clientpositive/semijoin.q.out ql/src/test/results/clientpositive/union24.q.out ql/src/test/results/compiler/plan/join2.q.xml To: JIRA, navis
          Hide
          Gunther Hagleitner added a comment -

          FYI: Ran all unit tests on patch .9. Failing tests are: groupby_distinct_samekey.q,join31.q,reduce_deduplicate_extended.q (TestCliDriver). Failures look like outdated golden files (explain output changed). Uploaded testclidriver.txt for reference.

          Show
          Gunther Hagleitner added a comment - FYI: Ran all unit tests on patch .9. Failing tests are: groupby_distinct_samekey.q,join31.q,reduce_deduplicate_extended.q (TestCliDriver). Failures look like outdated golden files (explain output changed). Uploaded testclidriver.txt for reference.
          Hide
          Gunther Hagleitner added a comment -

          just the diff of the latest unit test run.

          Show
          Gunther Hagleitner added a comment - just the diff of the latest unit test run.
          Hide
          Phabricator added a comment -

          hagleitn has commented on the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          Partial review

          INLINE COMMENTS
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:521 Not sure why this is needed or why this defaults to 4. From comment below it seems this is just to avoid the single reducer order-by case for performance reasons, is that correct?
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:787 Is this required or extra protection? Comment at the top of the file says mapjoin optimization happens before this (and probably should for performance reasons). Also, if I understand it correctly "joinAndSort" might be a better name than "fixed". You're basically saying that if an optimization wants to change the join after this they need to make sure the ordering of the keys is preserved, right?
          ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java:136 seems orthogonal to this patch.
          ql/src/test/queries/clientpositive/reduce_deduplicate.q:7 There are not a lot of tests, for min.reducer=1. No order by case for instance. Maybe the reduce_deduplicate_extended.q should run with both default and min.reducer=1.

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          To: JIRA, navis
          Cc: hagleitn

          Show
          Phabricator added a comment - hagleitn has commented on the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Partial review INLINE COMMENTS common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:521 Not sure why this is needed or why this defaults to 4. From comment below it seems this is just to avoid the single reducer order-by case for performance reasons, is that correct? ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:787 Is this required or extra protection? Comment at the top of the file says mapjoin optimization happens before this (and probably should for performance reasons). Also, if I understand it correctly "joinAndSort" might be a better name than "fixed". You're basically saying that if an optimization wants to change the join after this they need to make sure the ordering of the keys is preserved, right? ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java:136 seems orthogonal to this patch. ql/src/test/queries/clientpositive/reduce_deduplicate.q:7 There are not a lot of tests, for min.reducer=1. No order by case for instance. Maybe the reduce_deduplicate_extended.q should run with both default and min.reducer=1. REVISION DETAIL https://reviews.facebook.net/D1209 To: JIRA, navis Cc: hagleitn
          Hide
          Gunther Hagleitner added a comment -

          Partial review on phabricator. Biggest question is around "hive.optimize.reducededuplication.min.reducer". That basically disables the "orderby followed by groupby" optimization which was the original motivation for the jira. Navis, can you explain this some more?

          Might be another ticket, but would it be possible to optimize group by/sort by as well with this?

          Show
          Gunther Hagleitner added a comment - Partial review on phabricator. Biggest question is around "hive.optimize.reducededuplication.min.reducer". That basically disables the "orderby followed by groupby" optimization which was the original motivation for the jira. Navis, can you explain this some more? Might be another ticket, but would it be possible to optimize group by/sort by as well with this?
          Hide
          Navis added a comment -

          @Gunther Hagleitner,
          This optimization copies configuration of childRS to parentRS, which can lead to performance problems if childRS is run on single or small number of RSs (bucketing, etc), especially when parentRS is for JOIN or GBY. Thresholding it by "min.reducer" conf seemed not good enough. Do you have some better idea?

          Show
          Navis added a comment - @Gunther Hagleitner, This optimization copies configuration of childRS to parentRS, which can lead to performance problems if childRS is run on single or small number of RSs (bucketing, etc), especially when parentRS is for JOIN or GBY. Thresholding it by "min.reducer" conf seemed not good enough. Do you have some better idea?
          Hide
          Phabricator added a comment -

          navis has commented on the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          INLINE COMMENTS
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:521 Yes, it is. There should be a better configuration if possible.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:787 Yes, should be 'fixed' as sorted state. I'll rename it.
          I think there should be a configuration preventing MAPJOINable JOIN from being merged with next RS. But seemed hard to know if it's mapjoinable. Any idea?
          ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java:136 Yes, typos. I'll remove this from patch.
          ql/src/test/queries/clientpositive/reduce_deduplicate.q:7 ok.

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          To: JIRA, navis
          Cc: hagleitn

          Show
          Phabricator added a comment - navis has commented on the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". INLINE COMMENTS common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:521 Yes, it is. There should be a better configuration if possible. ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:787 Yes, should be 'fixed' as sorted state. I'll rename it. I think there should be a configuration preventing MAPJOINable JOIN from being merged with next RS. But seemed hard to know if it's mapjoinable. Any idea? ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java:136 Yes, typos. I'll remove this from patch. ql/src/test/queries/clientpositive/reduce_deduplicate.q:7 ok. REVISION DETAIL https://reviews.facebook.net/D1209 To: JIRA, navis Cc: hagleitn
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          Addressed comments

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D1209?vs=26343&id=27027#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java
          ql/src/test/queries/clientpositive/auto_join26.q
          ql/src/test/queries/clientpositive/groupby_distinct_samekey.q
          ql/src/test/queries/clientpositive/join31.q
          ql/src/test/queries/clientpositive/reduce_deduplicate.q
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/auto_join26.q.out
          ql/src/test/results/clientpositive/cluster.q.out
          ql/src/test/results/clientpositive/groupby2.q.out
          ql/src/test/results/clientpositive/groupby2_map_skew.q.out
          ql/src/test/results/clientpositive/groupby_cube1.q.out
          ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out
          ql/src/test/results/clientpositive/groupby_rollup1.q.out
          ql/src/test/results/clientpositive/index_bitmap3.q.out
          ql/src/test/results/clientpositive/index_bitmap_auto.q.out
          ql/src/test/results/clientpositive/join31.q.out
          ql/src/test/results/clientpositive/ppd2.q.out
          ql/src/test/results/clientpositive/ppd_gby_join.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          ql/src/test/results/clientpositive/semijoin.q.out
          ql/src/test/results/clientpositive/union24.q.out
          ql/src/test/results/compiler/plan/join2.q.xml

          To: JIRA, navis
          Cc: hagleitn

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Addressed comments Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D1209 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D1209?vs=26343&id=27027#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java ql/src/test/queries/clientpositive/auto_join26.q ql/src/test/queries/clientpositive/groupby_distinct_samekey.q ql/src/test/queries/clientpositive/join31.q ql/src/test/queries/clientpositive/reduce_deduplicate.q ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/auto_join26.q.out ql/src/test/results/clientpositive/cluster.q.out ql/src/test/results/clientpositive/groupby2.q.out ql/src/test/results/clientpositive/groupby2_map_skew.q.out ql/src/test/results/clientpositive/groupby_cube1.q.out ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out ql/src/test/results/clientpositive/groupby_rollup1.q.out ql/src/test/results/clientpositive/index_bitmap3.q.out ql/src/test/results/clientpositive/index_bitmap_auto.q.out ql/src/test/results/clientpositive/join31.q.out ql/src/test/results/clientpositive/ppd2.q.out ql/src/test/results/clientpositive/ppd_gby_join.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out ql/src/test/results/clientpositive/semijoin.q.out ql/src/test/results/clientpositive/union24.q.out ql/src/test/results/compiler/plan/join2.q.xml To: JIRA, navis Cc: hagleitn
          Hide
          Gunther Hagleitner added a comment -

          Navis: I think in general the logic should be to copy numReducers from parent to child not the other way around. If hive makes a decent estimate of reducers for the parent, that's probably the number you want to carry into the combined reduce stage, because that means each reducer is doing the desired amount of work. Buckets and order by are the only special cases I can think of, where the number needs to be fixed.

          For those special cases without knowing the cardinalities of join/group by/tables, it's indeed difficult to guess if the optimization should be on or off. However, what do you think of using a max ratio of parent reducers/child reducers instead of a fixed minimum number of reducers for the child? With a default of 4 maybe. I.e.: If there are less than 4 times as many reducers in the parent than in the child collapse (assuming another job will be more expensive than the lower number of reducers), else leave it alone. The optimization is only good if the input sizes of the child and parent reducers are similar and expressing this as a ratio of number of reducers is probably the closest we can get right now.

          This would enable the optimization for a larger body of queries (small tables, single input split, empty group by expr, etc).

          Show
          Gunther Hagleitner added a comment - Navis : I think in general the logic should be to copy numReducers from parent to child not the other way around. If hive makes a decent estimate of reducers for the parent, that's probably the number you want to carry into the combined reduce stage, because that means each reducer is doing the desired amount of work. Buckets and order by are the only special cases I can think of, where the number needs to be fixed. For those special cases without knowing the cardinalities of join/group by/tables, it's indeed difficult to guess if the optimization should be on or off. However, what do you think of using a max ratio of parent reducers/child reducers instead of a fixed minimum number of reducers for the child? With a default of 4 maybe. I.e.: If there are less than 4 times as many reducers in the parent than in the child collapse (assuming another job will be more expensive than the lower number of reducers), else leave it alone. The optimization is only good if the input sizes of the child and parent reducers are similar and expressing this as a ratio of number of reducers is probably the closest we can get right now. This would enable the optimization for a larger body of queries (small tables, single input split, empty group by expr, etc).
          Hide
          Navis added a comment -

          @Gunther Hagleitner: I also considered ratio thing, but number of reducers is calculated based on input size just before submitted to hadoop and cannot be known in optimizer layer.
          Except those special cases with order by and bucketing, number of reducers for both RS is -1. So generally speaking, it's safe.

          Show
          Navis added a comment - @Gunther Hagleitner: I also considered ratio thing, but number of reducers is calculated based on input size just before submitted to hadoop and cannot be known in optimizer layer. Except those special cases with order by and bucketing, number of reducers for both RS is -1. So generally speaking, it's safe.
          Hide
          Gunther Hagleitner added a comment -

          Ah, that's good info. Makes sense now. The patch is useful as is, but is the only way to actually optimize the groupby/orderby case to do the ratio thing as a conditional task? And if so would that be this or a follow up jira?

          Show
          Gunther Hagleitner added a comment - Ah, that's good info. Makes sense now. The patch is useful as is, but is the only way to actually optimize the groupby/orderby case to do the ratio thing as a conditional task? And if so would that be this or a follow up jira?
          Hide
          Phabricator added a comment -

          njain has commented on the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          A general question ???

          How does it work with hive.optimize.reducededuplication ?

          INLINE COMMENTS
          conf/hive-default.xml.template:1034 Sorry for joining late: Can you explain this more clearly ?

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          To: JIRA, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - njain has commented on the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". A general question ??? How does it work with hive.optimize.reducededuplication ? INLINE COMMENTS conf/hive-default.xml.template:1034 Sorry for joining late: Can you explain this more clearly ? REVISION DETAIL https://reviews.facebook.net/D1209 To: JIRA, navis Cc: hagleitn, njain
          Hide
          Namit Jain added a comment -

          comments

          Show
          Namit Jain added a comment - comments
          Hide
          Phabricator added a comment -

          njain has commented on the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          Do you think it might be a good idea to get HIVE-3972 first ?

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:99 Isn't it true that R1 and R2 will have the same cost for

          RS -> GBY --> anything --> RS ?

          If yes, how do you know which rule will be fired ?

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          To: JIRA, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - njain has commented on the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Do you think it might be a good idea to get HIVE-3972 first ? INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:99 Isn't it true that R1 and R2 will have the same cost for RS -> GBY --> anything --> RS ? If yes, how do you know which rule will be fired ? REVISION DETAIL https://reviews.facebook.net/D1209 To: JIRA, navis Cc: hagleitn, njain
          Hide
          Phabricator added a comment -

          hagleitn has commented on the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:138 HashSet?
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:251 I think the "number of reducers" story deserves more comments (similar to what you've explained on the jira)
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:787 I think if you just run this optimization after CommonJoinResolver everything should be fine. It will either already have converted joins to mapjoins and this optimization won't apply or you still have a regular join and you can merge it without worrying about missing out on a mapjoin conversion. You could still have the "sorted" flag to express intent, but there isn't any optimization that will pull the rug out under you at the moment. Am I missing something?

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          To: JIRA, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - hagleitn has commented on the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:138 HashSet? ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:251 I think the "number of reducers" story deserves more comments (similar to what you've explained on the jira) ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:787 I think if you just run this optimization after CommonJoinResolver everything should be fine. It will either already have converted joins to mapjoins and this optimization won't apply or you still have a regular join and you can merge it without worrying about missing out on a mapjoin conversion. You could still have the "sorted" flag to express intent, but there isn't any optimization that will pull the rug out under you at the moment. Am I missing something? REVISION DETAIL https://reviews.facebook.net/D1209 To: JIRA, navis Cc: hagleitn, njain
          Hide
          Phabricator added a comment -

          navis has commented on the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:138 ok.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:787 I wish I could but CommonJoinResolver is a physical optimizer, which means there is no RS-RS operator tree which could me merged on that stage.

          I'm thinking of disabling this optimization if user configured hive.auto.convert.join=true or hive.auto.convert.join.noconditionaltask=true.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:251 I'll add more explanations on hive-default.xml.template
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:99 For rules with same cost, DefaultRuleDispatcher selects last one, something like this,

            if ((cost >= 0) && (cost <= minCost)) {
                minCost = cost;
                rule = r;
            }
            

          So R2 will be selected.
          conf/hive-default.xml.template:1034 It's commented on https://issues.apache.org/jira/browse/HIVE-2340?focusedCommentId=13568361&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13568361

          This optimization merges two RSs by moving key/parts/num-reducers of child-RS to parent-RS, which means if num-reducer of child-RS is fixed (order by or forced bucketing) and small, it can resulted to very slow, single MR. For preventing this, the configuration makes min threshold for applying this optimization. It's not good enough, but I cannot think of better idea.

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          To: JIRA, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - navis has commented on the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:138 ok. ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:787 I wish I could but CommonJoinResolver is a physical optimizer, which means there is no RS-RS operator tree which could me merged on that stage. I'm thinking of disabling this optimization if user configured hive.auto.convert.join=true or hive.auto.convert.join.noconditionaltask=true. ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:251 I'll add more explanations on hive-default.xml.template ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:99 For rules with same cost, DefaultRuleDispatcher selects last one, something like this, if ((cost >= 0) && (cost <= minCost)) { minCost = cost; rule = r; } So R2 will be selected. conf/hive-default.xml.template:1034 It's commented on https://issues.apache.org/jira/browse/HIVE-2340?focusedCommentId=13568361&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13568361 This optimization merges two RSs by moving key/parts/num-reducers of child-RS to parent-RS, which means if num-reducer of child-RS is fixed (order by or forced bucketing) and small, it can resulted to very slow, single MR. For preventing this, the configuration makes min threshold for applying this optimization. It's not good enough, but I cannot think of better idea. REVISION DETAIL https://reviews.facebook.net/D1209 To: JIRA, navis Cc: hagleitn, njain
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          Addressed comments
          Disabled JOIN-RS dedup if user specified any of "auto.convert.join"s

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D1209?vs=27027&id=27315#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java
          ql/src/test/queries/clientpositive/auto_join26.q
          ql/src/test/queries/clientpositive/groupby_distinct_samekey.q
          ql/src/test/queries/clientpositive/join31.q
          ql/src/test/queries/clientpositive/reduce_deduplicate.q
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/cluster.q.out
          ql/src/test/results/clientpositive/groupby2.q.out
          ql/src/test/results/clientpositive/groupby2_map_skew.q.out
          ql/src/test/results/clientpositive/groupby_cube1.q.out
          ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out
          ql/src/test/results/clientpositive/groupby_rollup1.q.out
          ql/src/test/results/clientpositive/index_bitmap3.q.out
          ql/src/test/results/clientpositive/index_bitmap_auto.q.out
          ql/src/test/results/clientpositive/join31.q.out
          ql/src/test/results/clientpositive/ppd2.q.out
          ql/src/test/results/clientpositive/ppd_gby_join.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          ql/src/test/results/clientpositive/semijoin.q.out
          ql/src/test/results/clientpositive/union24.q.out
          ql/src/test/results/compiler/plan/join2.q.xml

          To: JIRA, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Addressed comments Disabled JOIN-RS dedup if user specified any of "auto.convert.join"s Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D1209 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D1209?vs=27027&id=27315#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java ql/src/test/queries/clientpositive/auto_join26.q ql/src/test/queries/clientpositive/groupby_distinct_samekey.q ql/src/test/queries/clientpositive/join31.q ql/src/test/queries/clientpositive/reduce_deduplicate.q ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/cluster.q.out ql/src/test/results/clientpositive/groupby2.q.out ql/src/test/results/clientpositive/groupby2_map_skew.q.out ql/src/test/results/clientpositive/groupby_cube1.q.out ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out ql/src/test/results/clientpositive/groupby_rollup1.q.out ql/src/test/results/clientpositive/index_bitmap3.q.out ql/src/test/results/clientpositive/index_bitmap_auto.q.out ql/src/test/results/clientpositive/join31.q.out ql/src/test/results/clientpositive/ppd2.q.out ql/src/test/results/clientpositive/ppd_gby_join.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out ql/src/test/results/clientpositive/semijoin.q.out ql/src/test/results/clientpositive/union24.q.out ql/src/test/results/compiler/plan/join2.q.xml To: JIRA, navis Cc: hagleitn, njain
          Hide
          Gunther Hagleitner added a comment -

          Navis: Disabling the join merge when auto.convert is on should work.

          I ran unit tests for patch11 and for me these TestCliDriver tests are failing: auto_join13.q,auto_join26.q,infer_bucket_sort.q,join13.q,partition_wise_fileformat14.q

          Show
          Gunther Hagleitner added a comment - Navis : Disabling the join merge when auto.convert is on should work. I ran unit tests for patch11 and for me these TestCliDriver tests are failing: auto_join13.q,auto_join26.q,infer_bucket_sort.q,join13.q,partition_wise_fileformat14.q
          Hide
          Navis added a comment -

          Gunther Hagleitner: Thanks for your help.

          Tests are failed by modifications of ColumnPrunerProcFactory, which was based on very old version of it. After removing that all tests have passed except infer_bucket_sort.q, which is affected by this optimization. Disabling the optimization made it work, too.

          Show
          Navis added a comment - Gunther Hagleitner : Thanks for your help. Tests are failed by modifications of ColumnPrunerProcFactory, which was based on very old version of it. After removing that all tests have passed except infer_bucket_sort.q, which is affected by this optimization. Disabling the optimization made it work, too.
          Hide
          Gunther Hagleitner added a comment -

          Navis: Are you suggesting to just revert the changes in ColumnPrunerProcFactory? I've tried that, but when I do that reduce_deduplicate_extended.q is failing. I've started to debug into that problem - it seems backtracking fails at that point because it can't find the source of "KEY._col0" etc. Seems there should be entries for that in the colExprMap of the RS, but that's not there. Are you getting different results?

          Show
          Gunther Hagleitner added a comment - Navis : Are you suggesting to just revert the changes in ColumnPrunerProcFactory? I've tried that, but when I do that reduce_deduplicate_extended.q is failing. I've started to debug into that problem - it seems backtracking fails at that point because it can't find the source of "KEY._col0" etc. Seems there should be entries for that in the colExprMap of the RS, but that's not there. Are you getting different results?
          Hide
          Navis added a comment -

          Gunther Hagleitner: Yes, it was HIVE-2339 and I'm not sure that it's fixed. And.. recently added GBY related operations(multi-GBY, cube, etc) are not making mappings for key from start. So that should be fixed also.

          Show
          Navis added a comment - Gunther Hagleitner : Yes, it was HIVE-2339 and I'm not sure that it's fixed. And.. recently added GBY related operations(multi-GBY, cube, etc) are not making mappings for key from start. So that should be fixed also.
          Hide
          Gunther Hagleitner added a comment -

          Navis: Thanks that pointer helped. The column pruning did indeed not carry over the some of the columns in the colExprMap. HIVE-2339 is in, but it was missing handling of the "KEY.*" columns.

          I've also looked at infer_bucket_sort and see what you're saying. Seems ok to have additional sort columns/buckets as long as the ones that are explicitly asked for are there. I've updated the golden file for that test.

          Running full test suite on .12 now.

          Show
          Gunther Hagleitner added a comment - Navis : Thanks that pointer helped. The column pruning did indeed not carry over the some of the columns in the colExprMap. HIVE-2339 is in, but it was missing handling of the "KEY.*" columns. I've also looked at infer_bucket_sort and see what you're saying. Seems ok to have additional sort columns/buckets as long as the ones that are explicitly asked for are there. I've updated the golden file for that test. Running full test suite on .12 now.
          Hide
          Gunther Hagleitner added a comment -

          Clean bill of health on 12, except for incorrect golden files in TestParse_join2 and TestMinimrCliDriver_reduce_deduplicate. I've updated the golden files in patch .13.

          Show
          Gunther Hagleitner added a comment - Clean bill of health on 12, except for incorrect golden files in TestParse_join2 and TestMinimrCliDriver_reduce_deduplicate. I've updated the golden files in patch .13.
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          1. Changed policy of creating new metadatas(colExprMap, etc) in ColumnPrunerProcFactory.pruneReduceSinkOperator()

          • Remove not retained values from RowResolver, colExprMap and schema (instead of creating new entities by adding retained values)
            2. Changed order of applying CP and PPD. Now PPD applies first and CP next (which was CP-PPD)
          • CP removes some expr mappings which was not yet propagated by PPD
          • Also removed pruning schema of FilterOperator, which seemed not right (It's not certain that TS will actually prune columns)
            3. Refactored to share same code base in ExprNodeDescUtils which was introduced by HIVE-2839

          Will run full test tonight

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D1209?vs=27315&id=27669#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java
          ql/src/test/queries/clientpositive/auto_join26.q
          ql/src/test/queries/clientpositive/groupby_distinct_samekey.q
          ql/src/test/queries/clientpositive/reduce_deduplicate.q
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/cluster.q.out
          ql/src/test/results/clientpositive/groupby2.q.out
          ql/src/test/results/clientpositive/groupby2_map_skew.q.out
          ql/src/test/results/clientpositive/groupby_cube1.q.out
          ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out
          ql/src/test/results/clientpositive/groupby_rollup1.q.out
          ql/src/test/results/clientpositive/index_bitmap3.q.out
          ql/src/test/results/clientpositive/index_bitmap_auto.q.out
          ql/src/test/results/clientpositive/infer_bucket_sort.q.out
          ql/src/test/results/clientpositive/ppd2.q.out
          ql/src/test/results/clientpositive/ppd_gby_join.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          ql/src/test/results/clientpositive/semijoin.q.out
          ql/src/test/results/clientpositive/union24.q.out
          ql/src/test/results/compiler/plan/input2.q.xml
          ql/src/test/results/compiler/plan/input3.q.xml
          ql/src/test/results/compiler/plan/join1.q.xml
          ql/src/test/results/compiler/plan/join2.q.xml
          ql/src/test/results/compiler/plan/join3.q.xml
          ql/src/test/results/compiler/plan/sample1.q.xml
          ql/src/test/results/compiler/plan/sample2.q.xml
          ql/src/test/results/compiler/plan/sample3.q.xml
          ql/src/test/results/compiler/plan/sample4.q.xml
          ql/src/test/results/compiler/plan/sample5.q.xml
          ql/src/test/results/compiler/plan/sample6.q.xml
          ql/src/test/results/compiler/plan/sample7.q.xml

          To: JIRA, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". 1. Changed policy of creating new metadatas(colExprMap, etc) in ColumnPrunerProcFactory.pruneReduceSinkOperator() Remove not retained values from RowResolver, colExprMap and schema (instead of creating new entities by adding retained values) 2. Changed order of applying CP and PPD. Now PPD applies first and CP next (which was CP-PPD) CP removes some expr mappings which was not yet propagated by PPD Also removed pruning schema of FilterOperator, which seemed not right (It's not certain that TS will actually prune columns) 3. Refactored to share same code base in ExprNodeDescUtils which was introduced by HIVE-2839 Will run full test tonight Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D1209 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D1209?vs=27315&id=27669#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java ql/src/test/queries/clientpositive/auto_join26.q ql/src/test/queries/clientpositive/groupby_distinct_samekey.q ql/src/test/queries/clientpositive/reduce_deduplicate.q ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/cluster.q.out ql/src/test/results/clientpositive/groupby2.q.out ql/src/test/results/clientpositive/groupby2_map_skew.q.out ql/src/test/results/clientpositive/groupby_cube1.q.out ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out ql/src/test/results/clientpositive/groupby_rollup1.q.out ql/src/test/results/clientpositive/index_bitmap3.q.out ql/src/test/results/clientpositive/index_bitmap_auto.q.out ql/src/test/results/clientpositive/infer_bucket_sort.q.out ql/src/test/results/clientpositive/ppd2.q.out ql/src/test/results/clientpositive/ppd_gby_join.q.out ql/src/test/results/clientpositive/reduce_deduplicate.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out ql/src/test/results/clientpositive/semijoin.q.out ql/src/test/results/clientpositive/union24.q.out ql/src/test/results/compiler/plan/input2.q.xml ql/src/test/results/compiler/plan/input3.q.xml ql/src/test/results/compiler/plan/join1.q.xml ql/src/test/results/compiler/plan/join2.q.xml ql/src/test/results/compiler/plan/join3.q.xml ql/src/test/results/compiler/plan/sample1.q.xml ql/src/test/results/compiler/plan/sample2.q.xml ql/src/test/results/compiler/plan/sample3.q.xml ql/src/test/results/compiler/plan/sample4.q.xml ql/src/test/results/compiler/plan/sample5.q.xml ql/src/test/results/compiler/plan/sample6.q.xml ql/src/test/results/compiler/plan/sample7.q.xml To: JIRA, navis Cc: hagleitn, njain
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          Rolled back removing schema of FIL on CP optimization (made problem with some joins)

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D1209?vs=27669&id=27729#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java
          ql/src/test/queries/clientpositive/auto_join26.q
          ql/src/test/queries/clientpositive/groupby_distinct_samekey.q
          ql/src/test/queries/clientpositive/reduce_deduplicate.q
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/auto_join19.q.out
          ql/src/test/results/clientpositive/auto_join9.q.out
          ql/src/test/results/clientpositive/bucketmapjoin1.q.out
          ql/src/test/results/clientpositive/bucketmapjoin_negative.q.out
          ql/src/test/results/clientpositive/cluster.q.out
          ql/src/test/results/clientpositive/filter_join_breaktask.q.out
          ql/src/test/results/clientpositive/groupby2.q.out
          ql/src/test/results/clientpositive/groupby2_map_skew.q.out
          ql/src/test/results/clientpositive/groupby_cube1.q.out
          ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out
          ql/src/test/results/clientpositive/groupby_rollup1.q.out
          ql/src/test/results/clientpositive/index_auto_mult_tables.q.out
          ql/src/test/results/clientpositive/index_auto_mult_tables_compact.q.out
          ql/src/test/results/clientpositive/index_auto_self_join.q.out
          ql/src/test/results/clientpositive/index_bitmap3.q.out
          ql/src/test/results/clientpositive/index_bitmap_auto.q.out
          ql/src/test/results/clientpositive/index_bitmap_auto_partitioned.q.out
          ql/src/test/results/clientpositive/index_bitmap_compression.q.out
          ql/src/test/results/clientpositive/infer_bucket_sort.q.out
          ql/src/test/results/clientpositive/input39_hadoop20.q.out
          ql/src/test/results/clientpositive/join38.q.out
          ql/src/test/results/clientpositive/join9.q.out
          ql/src/test/results/clientpositive/join_map_ppr.q.out
          ql/src/test/results/clientpositive/lateral_view_ppd.q.out
          ql/src/test/results/clientpositive/louter_join_ppr.q.out
          ql/src/test/results/clientpositive/noalias_subq1.q.out
          ql/src/test/results/clientpositive/ppd2.q.out
          ql/src/test/results/clientpositive/ppd_gby.q.out
          ql/src/test/results/clientpositive/ppd_gby_join.q.out
          ql/src/test/results/clientpositive/ppd_repeated_alias.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          ql/src/test/results/clientpositive/router_join_ppr.q.out
          ql/src/test/results/clientpositive/semijoin.q.out
          ql/src/test/results/clientpositive/smb_mapjoin9.q.out
          ql/src/test/results/clientpositive/sort_merge_join_desc_1.q.out
          ql/src/test/results/clientpositive/sort_merge_join_desc_2.q.out
          ql/src/test/results/clientpositive/sort_merge_join_desc_3.q.out
          ql/src/test/results/clientpositive/sort_merge_join_desc_4.q.out
          ql/src/test/results/clientpositive/stats11.q.out
          ql/src/test/results/clientpositive/union24.q.out
          ql/src/test/results/clientpositive/union26.q.out
          ql/src/test/results/clientpositive/union_view.q.out
          ql/src/test/results/compiler/plan/case_sensitivity.q.xml
          ql/src/test/results/compiler/plan/cast1.q.xml
          ql/src/test/results/compiler/plan/input1.q.xml
          ql/src/test/results/compiler/plan/input6.q.xml
          ql/src/test/results/compiler/plan/input9.q.xml
          ql/src/test/results/compiler/plan/input_part1.q.xml
          ql/src/test/results/compiler/plan/input_testxpath2.q.xml
          ql/src/test/results/compiler/plan/join1.q.xml
          ql/src/test/results/compiler/plan/join2.q.xml
          ql/src/test/results/compiler/plan/join3.q.xml
          ql/src/test/results/compiler/plan/join4.q.xml
          ql/src/test/results/compiler/plan/join5.q.xml
          ql/src/test/results/compiler/plan/join6.q.xml
          ql/src/test/results/compiler/plan/join7.q.xml
          ql/src/test/results/compiler/plan/join8.q.xml
          ql/src/test/results/compiler/plan/sample7.q.xml
          ql/src/test/results/compiler/plan/subq.q.xml
          ql/src/test/results/compiler/plan/udf1.q.xml
          ql/src/test/results/compiler/plan/union.q.xml

          To: JIRA, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Rolled back removing schema of FIL on CP optimization (made problem with some joins) Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D1209 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D1209?vs=27669&id=27729#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java ql/src/test/queries/clientpositive/auto_join26.q ql/src/test/queries/clientpositive/groupby_distinct_samekey.q ql/src/test/queries/clientpositive/reduce_deduplicate.q ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/auto_join19.q.out ql/src/test/results/clientpositive/auto_join9.q.out ql/src/test/results/clientpositive/bucketmapjoin1.q.out ql/src/test/results/clientpositive/bucketmapjoin_negative.q.out ql/src/test/results/clientpositive/cluster.q.out ql/src/test/results/clientpositive/filter_join_breaktask.q.out ql/src/test/results/clientpositive/groupby2.q.out ql/src/test/results/clientpositive/groupby2_map_skew.q.out ql/src/test/results/clientpositive/groupby_cube1.q.out ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out ql/src/test/results/clientpositive/groupby_rollup1.q.out ql/src/test/results/clientpositive/index_auto_mult_tables.q.out ql/src/test/results/clientpositive/index_auto_mult_tables_compact.q.out ql/src/test/results/clientpositive/index_auto_self_join.q.out ql/src/test/results/clientpositive/index_bitmap3.q.out ql/src/test/results/clientpositive/index_bitmap_auto.q.out ql/src/test/results/clientpositive/index_bitmap_auto_partitioned.q.out ql/src/test/results/clientpositive/index_bitmap_compression.q.out ql/src/test/results/clientpositive/infer_bucket_sort.q.out ql/src/test/results/clientpositive/input39_hadoop20.q.out ql/src/test/results/clientpositive/join38.q.out ql/src/test/results/clientpositive/join9.q.out ql/src/test/results/clientpositive/join_map_ppr.q.out ql/src/test/results/clientpositive/lateral_view_ppd.q.out ql/src/test/results/clientpositive/louter_join_ppr.q.out ql/src/test/results/clientpositive/noalias_subq1.q.out ql/src/test/results/clientpositive/ppd2.q.out ql/src/test/results/clientpositive/ppd_gby.q.out ql/src/test/results/clientpositive/ppd_gby_join.q.out ql/src/test/results/clientpositive/ppd_repeated_alias.q.out ql/src/test/results/clientpositive/reduce_deduplicate.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out ql/src/test/results/clientpositive/router_join_ppr.q.out ql/src/test/results/clientpositive/semijoin.q.out ql/src/test/results/clientpositive/smb_mapjoin9.q.out ql/src/test/results/clientpositive/sort_merge_join_desc_1.q.out ql/src/test/results/clientpositive/sort_merge_join_desc_2.q.out ql/src/test/results/clientpositive/sort_merge_join_desc_3.q.out ql/src/test/results/clientpositive/sort_merge_join_desc_4.q.out ql/src/test/results/clientpositive/stats11.q.out ql/src/test/results/clientpositive/union24.q.out ql/src/test/results/clientpositive/union26.q.out ql/src/test/results/clientpositive/union_view.q.out ql/src/test/results/compiler/plan/case_sensitivity.q.xml ql/src/test/results/compiler/plan/cast1.q.xml ql/src/test/results/compiler/plan/input1.q.xml ql/src/test/results/compiler/plan/input6.q.xml ql/src/test/results/compiler/plan/input9.q.xml ql/src/test/results/compiler/plan/input_part1.q.xml ql/src/test/results/compiler/plan/input_testxpath2.q.xml ql/src/test/results/compiler/plan/join1.q.xml ql/src/test/results/compiler/plan/join2.q.xml ql/src/test/results/compiler/plan/join3.q.xml ql/src/test/results/compiler/plan/join4.q.xml ql/src/test/results/compiler/plan/join5.q.xml ql/src/test/results/compiler/plan/join6.q.xml ql/src/test/results/compiler/plan/join7.q.xml ql/src/test/results/compiler/plan/join8.q.xml ql/src/test/results/compiler/plan/sample7.q.xml ql/src/test/results/compiler/plan/subq.q.xml ql/src/test/results/compiler/plan/udf1.q.xml ql/src/test/results/compiler/plan/union.q.xml To: JIRA, navis Cc: hagleitn, njain
          Hide
          Navis added a comment -

          Passed all tests

          Show
          Navis added a comment - Passed all tests
          Hide
          Gunther Hagleitner added a comment -

          Navis: I've tried the latest patch (.13) and it doesn't apply cleanly anymore. I rebased it, but after doing so I got an NPE from auto_join19.q. Something must have changed in the meantime.

          I was however able to get the previous version of the patch to apply and work with just a minor change. Tests pass for me on that one. I'll upload, in case you want to take a look. Feel free to ignore.

          Show
          Gunther Hagleitner added a comment - Navis : I've tried the latest patch (.13) and it doesn't apply cleanly anymore. I rebased it, but after doing so I got an NPE from auto_join19.q. Something must have changed in the meantime. I was however able to get the previous version of the patch to apply and work with just a minor change. Tests pass for me on that one. I'll upload, in case you want to take a look. Feel free to ignore.
          Hide
          Gunther Hagleitner added a comment -

          As mentioned. Based on .12. This one applys and works for me.

          Show
          Gunther Hagleitner added a comment - As mentioned. Based on .12. This one applys and works for me.
          Hide
          Gunther Hagleitner added a comment -

          HIVE-2340.14.patch that is.

          Show
          Gunther Hagleitner added a comment - HIVE-2340 .14.patch that is.
          Hide
          Navis added a comment -

          Gunther Hagleitner: It seemed affected by HIVE-948, which is recently committed. Thank for your help. I'll check that.

          Show
          Navis added a comment - Gunther Hagleitner : It seemed affected by HIVE-948 , which is recently committed. Thank for your help. I'll check that.
          Hide
          Navis added a comment -

          Seemed heavily broken by recent changes.

          Show
          Navis added a comment - Seemed heavily broken by recent changes.
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          Rebased to trunk

          Reviewers: JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D1209?vs=27729&id=29571#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/NonBlockingOpDeDupProc.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java
          ql/src/test/queries/clientpositive/auto_join26.q
          ql/src/test/queries/clientpositive/groupby_distinct_samekey.q
          ql/src/test/queries/clientpositive/reduce_deduplicate.q
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/cluster.q.out
          ql/src/test/results/clientpositive/groupby2.q.out
          ql/src/test/results/clientpositive/groupby2_map_skew.q.out
          ql/src/test/results/clientpositive/groupby_cube1.q.out
          ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out
          ql/src/test/results/clientpositive/groupby_rollup1.q.out
          ql/src/test/results/clientpositive/index_bitmap3.q.out
          ql/src/test/results/clientpositive/index_bitmap_auto.q.out
          ql/src/test/results/clientpositive/infer_bucket_sort.q.out
          ql/src/test/results/clientpositive/ppd2.q.out
          ql/src/test/results/clientpositive/ppd_gby_join.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          ql/src/test/results/clientpositive/semijoin.q.out
          ql/src/test/results/clientpositive/union24.q.out
          ql/src/test/results/compiler/plan/join1.q.xml
          ql/src/test/results/compiler/plan/join2.q.xml
          ql/src/test/results/compiler/plan/join3.q.xml

          To: JIRA, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Rebased to trunk Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D1209 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D1209?vs=27729&id=29571#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/NonBlockingOpDeDupProc.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java ql/src/test/queries/clientpositive/auto_join26.q ql/src/test/queries/clientpositive/groupby_distinct_samekey.q ql/src/test/queries/clientpositive/reduce_deduplicate.q ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/cluster.q.out ql/src/test/results/clientpositive/groupby2.q.out ql/src/test/results/clientpositive/groupby2_map_skew.q.out ql/src/test/results/clientpositive/groupby_cube1.q.out ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out ql/src/test/results/clientpositive/groupby_rollup1.q.out ql/src/test/results/clientpositive/index_bitmap3.q.out ql/src/test/results/clientpositive/index_bitmap_auto.q.out ql/src/test/results/clientpositive/infer_bucket_sort.q.out ql/src/test/results/clientpositive/ppd2.q.out ql/src/test/results/clientpositive/ppd_gby_join.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out ql/src/test/results/clientpositive/semijoin.q.out ql/src/test/results/clientpositive/union24.q.out ql/src/test/results/compiler/plan/join1.q.xml ql/src/test/results/compiler/plan/join2.q.xml ql/src/test/results/compiler/plan/join3.q.xml To: JIRA, navis Cc: hagleitn, njain
          Hide
          Gunther Hagleitner added a comment -

          Ran unit tests as well. All passed. Still going through the review.

          Show
          Gunther Hagleitner added a comment - Ran unit tests as well. All passed. Still going through the review.
          Hide
          Phabricator added a comment -

          hagleitn has accepted the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          BRANCH
          DPAL-592

          ARCANIST PROJECT
          hive

          To: JIRA, hagleitn, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - hagleitn has accepted the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". REVISION DETAIL https://reviews.facebook.net/D1209 BRANCH DPAL-592 ARCANIST PROJECT hive To: JIRA, hagleitn, navis Cc: hagleitn, njain
          Hide
          Gunther Hagleitner added a comment -

          Navis: The patch doesn't apply cleanly anymore: Culprit is HIVE-4186, but the fix is no longer actually needed with your changes in ReduceSinkDeDuplication.

          After rebasing it, I found it to fail "ptf.q" - a new testcase for windowing. The problem seems to be that something in the ptf code is hanging on to the schema/signature of the reducesink. Before your change we cloned it, now you're modifying it. When I added code to clone the signature it passed all windowing tests.

          Can you take a look?

          Show
          Gunther Hagleitner added a comment - Navis : The patch doesn't apply cleanly anymore: Culprit is HIVE-4186 , but the fix is no longer actually needed with your changes in ReduceSinkDeDuplication. After rebasing it, I found it to fail "ptf.q" - a new testcase for windowing. The problem seems to be that something in the ptf code is hanging on to the schema/signature of the reducesink. Before your change we cloned it, now you're modifying it. When I added code to clone the signature it passed all windowing tests. Can you take a look?
          Hide
          Ashutosh Chauhan added a comment -

          Would love to see this make it in 0.11. Navis / Gunther Hagleitner Let me know if its ready to go in.

          Show
          Ashutosh Chauhan added a comment - Would love to see this make it in 0.11. Navis / Gunther Hagleitner Let me know if its ready to go in.
          Hide
          Gunther Hagleitner added a comment -

          Navis: In case you want to take a look, HIVE-2340.14.rebased_and_schema_clone.patch applies cleanly and has the signature cloning code.

          Show
          Gunther Hagleitner added a comment - Navis : In case you want to take a look, HIVE-2340 .14.rebased_and_schema_clone.patch applies cleanly and has the signature cloning code.
          Hide
          Navis added a comment -

          Gunther Hagleitner Thanks. I'll check that.

          Show
          Navis added a comment - Gunther Hagleitner Thanks. I'll check that.
          Hide
          Navis added a comment -

          I think CP for PTF is broken. Especially for single sourced multi-insert query.

          Show
          Navis added a comment - I think CP for PTF is broken. Especially for single sourced multi-insert query.
          Hide
          Ashutosh Chauhan added a comment -

          Navis I am not sure which test case you have in mind, but I ran following test and it worked for me.

          hive> create table t1 (a1 int, b1 string); 
          hive> create table t2 (a1 int, b1 string);
          hive> from (select sum(i) over (), s from over10k) tt insert overwrite table t1 select * insert overwrite table t2 select * ;
          hive> select * from t1 limit 3;
          hive> select * from t2 limit 3; 
          

          I got expected results. If you run explain on that query you will see it has PTFOperator. You can check windowing_udaf.q to see the schema and data for over10k table. Let me know which query have in mind, for which CP for PTF may be broken.

          Show
          Ashutosh Chauhan added a comment - Navis I am not sure which test case you have in mind, but I ran following test and it worked for me. hive> create table t1 (a1 int , b1 string); hive> create table t2 (a1 int , b1 string); hive> from (select sum(i) over (), s from over10k) tt insert overwrite table t1 select * insert overwrite table t2 select * ; hive> select * from t1 limit 3; hive> select * from t2 limit 3; I got expected results. If you run explain on that query you will see it has PTFOperator. You can check windowing_udaf.q to see the schema and data for over10k table. Let me know which query have in mind, for which CP for PTF may be broken.
          Hide
          Navis added a comment -

          CP for RS is correctly modifies schema of RS, and it makes CP tests for PTF fail. But as suggested by Hagleitner, after making schema of RS intact even after CP, those tests are resulted to success. I believe there is some assumption on schema in CP for PTF.

          Show
          Navis added a comment - CP for RS is correctly modifies schema of RS, and it makes CP tests for PTF fail. But as suggested by Hagleitner, after making schema of RS intact even after CP, those tests are resulted to success. I believe there is some assumption on schema in CP for PTF.
          Hide
          Ashutosh Chauhan added a comment -

          In PTF handling we do keep a copy of schema in Desc object to be used later at runtime. This could be improved. Me and Harish are exploring that. But I think Gunther's fix gets us around that, so for this patch I would recommend to go ahead with Gunther's fix, since this is hanging on for more than 6 months and no point in delaying it further. So, Navis if you are on board I will like to get .14 patch in trunk.

          Show
          Ashutosh Chauhan added a comment - In PTF handling we do keep a copy of schema in Desc object to be used later at runtime. This could be improved. Me and Harish are exploring that. But I think Gunther's fix gets us around that, so for this patch I would recommend to go ahead with Gunther's fix, since this is hanging on for more than 6 months and no point in delaying it further. So, Navis if you are on board I will like to get .14 patch in trunk.
          Hide
          Navis added a comment -

          Got it. I've updated patch (added small comment on this). Thanks.

          Show
          Navis added a comment - Got it. I've updated patch (added small comment on this). Thanks.
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          Keep schema of RS intact in CP

          Reviewers: hagleitn, JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D1209?vs=29571&id=31179#toc

          BRANCH
          DPAL-592

          ARCANIST PROJECT
          hive

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/NonBlockingOpDeDupProc.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java
          ql/src/test/queries/clientpositive/auto_join26.q
          ql/src/test/queries/clientpositive/groupby_distinct_samekey.q
          ql/src/test/queries/clientpositive/reduce_deduplicate.q
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          ql/src/test/results/clientpositive/cluster.q.out
          ql/src/test/results/clientpositive/groupby2.q.out
          ql/src/test/results/clientpositive/groupby2_map_skew.q.out
          ql/src/test/results/clientpositive/groupby_cube1.q.out
          ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out
          ql/src/test/results/clientpositive/groupby_rollup1.q.out
          ql/src/test/results/clientpositive/index_bitmap3.q.out
          ql/src/test/results/clientpositive/index_bitmap_auto.q.out
          ql/src/test/results/clientpositive/infer_bucket_sort.q.out
          ql/src/test/results/clientpositive/ppd2.q.out
          ql/src/test/results/clientpositive/ppd_gby_join.q.out
          ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          ql/src/test/results/clientpositive/semijoin.q.out
          ql/src/test/results/clientpositive/union24.q.out
          ql/src/test/results/compiler/plan/join1.q.xml
          ql/src/test/results/compiler/plan/join2.q.xml
          ql/src/test/results/compiler/plan/join3.q.xml

          To: JIRA, hagleitn, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". Keep schema of RS intact in CP Reviewers: hagleitn, JIRA REVISION DETAIL https://reviews.facebook.net/D1209 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D1209?vs=29571&id=31179#toc BRANCH DPAL-592 ARCANIST PROJECT hive AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/NonBlockingOpDeDupProc.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java ql/src/test/queries/clientpositive/auto_join26.q ql/src/test/queries/clientpositive/groupby_distinct_samekey.q ql/src/test/queries/clientpositive/reduce_deduplicate.q ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q ql/src/test/results/clientpositive/cluster.q.out ql/src/test/results/clientpositive/groupby2.q.out ql/src/test/results/clientpositive/groupby2_map_skew.q.out ql/src/test/results/clientpositive/groupby_cube1.q.out ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out ql/src/test/results/clientpositive/groupby_rollup1.q.out ql/src/test/results/clientpositive/index_bitmap3.q.out ql/src/test/results/clientpositive/index_bitmap_auto.q.out ql/src/test/results/clientpositive/infer_bucket_sort.q.out ql/src/test/results/clientpositive/ppd2.q.out ql/src/test/results/clientpositive/ppd_gby_join.q.out ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out ql/src/test/results/clientpositive/semijoin.q.out ql/src/test/results/clientpositive/union24.q.out ql/src/test/results/compiler/plan/join1.q.xml ql/src/test/results/compiler/plan/join2.q.xml ql/src/test/results/compiler/plan/join3.q.xml To: JIRA, hagleitn, navis Cc: hagleitn, njain
          Hide
          Ashutosh Chauhan added a comment -

          Cool. Running test on latest patch. Will commit if tests pass.

          Show
          Ashutosh Chauhan added a comment - Cool. Running test on latest patch. Will commit if tests pass.
          Hide
          Harish Butani added a comment -

          The problem with PTF is in plan generation. In SemanticAnalyzer::genReduceSinkPlanForWindowing line 10778, we are setting the RowSchema of the ReduceSinkOp that precedes a PTF to be the same one as the Op that precedes it. In the e.g. that was failing the preceding Op was another PTFOp. The fix is to set both the RowSchema and RowResolver by not pointing to the 'input' Op's structures(same issue in genPTFPlanForComponentQuery). Will fix in the separate Jira. Will add back 'schema.getSignature().remove(colInfo);' to ColumnPrunerProcFactory. Sorry about this.

          Show
          Harish Butani added a comment - The problem with PTF is in plan generation. In SemanticAnalyzer::genReduceSinkPlanForWindowing line 10778, we are setting the RowSchema of the ReduceSinkOp that precedes a PTF to be the same one as the Op that precedes it. In the e.g. that was failing the preceding Op was another PTFOp. The fix is to set both the RowSchema and RowResolver by not pointing to the 'input' Op's structures(same issue in genPTFPlanForComponentQuery). Will fix in the separate Jira. Will add back 'schema.getSignature().remove(colInfo);' to ColumnPrunerProcFactory. Sorry about this.
          Hide
          Ashutosh Chauhan added a comment -

          Following tests failed on latest patch:

          • TestCliDriver.auto_join26.q
          • TestCliDriver.index_bitmap3.q
          • TestCliDriver.index_bitmap_auto.q
          • TestCliDriver.infer_bucket_sort.q
          • TestCliDriver.ppd_gby_join.q
          • TestCliDriver.semijoin.q
          • TestCliDriver.union24.q
          • TestCliDriver.cluster.q
          • TestMinimrCliDriver.infer_bucket_sort_reducers_power_two.q
          • TestParse.join1
          • TestParse.join2
          • TestParse.join3
          Show
          Ashutosh Chauhan added a comment - Following tests failed on latest patch: TestCliDriver.auto_join26.q TestCliDriver.index_bitmap3.q TestCliDriver.index_bitmap_auto.q TestCliDriver.infer_bucket_sort.q TestCliDriver.ppd_gby_join.q TestCliDriver.semijoin.q TestCliDriver.union24.q TestCliDriver.cluster.q TestMinimrCliDriver.infer_bucket_sort_reducers_power_two.q TestParse.join1 TestParse.join2 TestParse.join3
          Hide
          Navis added a comment -

          When auto.convert.join is true(default for now), RSDedup cannot try to merge following RS, effectively disabling the code part. I think physical planning should be postponed to completion time of each task. I'll update test results.

          Show
          Navis added a comment - When auto.convert.join is true(default for now), RSDedup cannot try to merge following RS, effectively disabling the code part. I think physical planning should be postponed to completion time of each task. I'll update test results.
          Hide
          Ashutosh Chauhan added a comment -

          Also, HIVE-4302 is resolved now, so workaround put in by Gunther should now longer be required.

          Show
          Ashutosh Chauhan added a comment - Also, HIVE-4302 is resolved now, so workaround put in by Gunther should now longer be required.
          Hide
          Navis added a comment -

          Phabricator seemed gone.

          Show
          Navis added a comment - Phabricator seemed gone.
          Hide
          Ashutosh Chauhan added a comment -

          Committed to trunk. Thanks, Gunther and Navis for this cool optimization!

          Show
          Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Gunther and Navis for this cool optimization!
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-hadoop2 #147 (See https://builds.apache.org/job/Hive-trunk-hadoop2/147/)
          HIVE-2340 : optimize orderby followed by a groupby (Navis via Ashutosh Chauhan) (Revision 1465721)

          Result = FAILURE
          hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1465721
          Files :

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/conf/hive-default.xml.template
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/NonBlockingOpDeDupProc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java
          • /hive/trunk/ql/src/test/queries/clientpositive/auto_join26.q
          • /hive/trunk/ql/src/test/queries/clientpositive/groupby_distinct_samekey.q
          • /hive/trunk/ql/src/test/queries/clientpositive/reduce_deduplicate.q
          • /hive/trunk/ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          • /hive/trunk/ql/src/test/results/clientpositive/groupby2.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/groupby2_map_skew.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/groupby_cube1.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/groupby_rollup1.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/infer_bucket_sort.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/ppd2.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          • /hive/trunk/ql/src/test/results/compiler/plan/join1.q.xml
          • /hive/trunk/ql/src/test/results/compiler/plan/join2.q.xml
          • /hive/trunk/ql/src/test/results/compiler/plan/join3.q.xml
          Show
          Hudson added a comment - Integrated in Hive-trunk-hadoop2 #147 (See https://builds.apache.org/job/Hive-trunk-hadoop2/147/ ) HIVE-2340 : optimize orderby followed by a groupby (Navis via Ashutosh Chauhan) (Revision 1465721) Result = FAILURE hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1465721 Files : /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/conf/hive-default.xml.template /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/NonBlockingOpDeDupProc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java /hive/trunk/ql/src/test/queries/clientpositive/auto_join26.q /hive/trunk/ql/src/test/queries/clientpositive/groupby_distinct_samekey.q /hive/trunk/ql/src/test/queries/clientpositive/reduce_deduplicate.q /hive/trunk/ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q /hive/trunk/ql/src/test/results/clientpositive/groupby2.q.out /hive/trunk/ql/src/test/results/clientpositive/groupby2_map_skew.q.out /hive/trunk/ql/src/test/results/clientpositive/groupby_cube1.q.out /hive/trunk/ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out /hive/trunk/ql/src/test/results/clientpositive/groupby_rollup1.q.out /hive/trunk/ql/src/test/results/clientpositive/infer_bucket_sort.q.out /hive/trunk/ql/src/test/results/clientpositive/ppd2.q.out /hive/trunk/ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out /hive/trunk/ql/src/test/results/compiler/plan/join1.q.xml /hive/trunk/ql/src/test/results/compiler/plan/join2.q.xml /hive/trunk/ql/src/test/results/compiler/plan/join3.q.xml
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-h0.21 #2053 (See https://builds.apache.org/job/Hive-trunk-h0.21/2053/)
          HIVE-2340 : optimize orderby followed by a groupby (Navis via Ashutosh Chauhan) (Revision 1465721)

          Result = FAILURE
          hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1465721
          Files :

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/conf/hive-default.xml.template
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/NonBlockingOpDeDupProc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java
          • /hive/trunk/ql/src/test/queries/clientpositive/auto_join26.q
          • /hive/trunk/ql/src/test/queries/clientpositive/groupby_distinct_samekey.q
          • /hive/trunk/ql/src/test/queries/clientpositive/reduce_deduplicate.q
          • /hive/trunk/ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
          • /hive/trunk/ql/src/test/results/clientpositive/groupby2.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/groupby2_map_skew.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/groupby_cube1.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/groupby_rollup1.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/infer_bucket_sort.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/ppd2.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
          • /hive/trunk/ql/src/test/results/compiler/plan/join1.q.xml
          • /hive/trunk/ql/src/test/results/compiler/plan/join2.q.xml
          • /hive/trunk/ql/src/test/results/compiler/plan/join3.q.xml
          Show
          Hudson added a comment - Integrated in Hive-trunk-h0.21 #2053 (See https://builds.apache.org/job/Hive-trunk-h0.21/2053/ ) HIVE-2340 : optimize orderby followed by a groupby (Navis via Ashutosh Chauhan) (Revision 1465721) Result = FAILURE hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1465721 Files : /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/conf/hive-default.xml.template /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/NonBlockingOpDeDupProc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinResolver.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SkewJoinProcFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java /hive/trunk/ql/src/test/queries/clientpositive/auto_join26.q /hive/trunk/ql/src/test/queries/clientpositive/groupby_distinct_samekey.q /hive/trunk/ql/src/test/queries/clientpositive/reduce_deduplicate.q /hive/trunk/ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q /hive/trunk/ql/src/test/results/clientpositive/groupby2.q.out /hive/trunk/ql/src/test/results/clientpositive/groupby2_map_skew.q.out /hive/trunk/ql/src/test/results/clientpositive/groupby_cube1.q.out /hive/trunk/ql/src/test/results/clientpositive/groupby_distinct_samekey.q.out /hive/trunk/ql/src/test/results/clientpositive/groupby_rollup1.q.out /hive/trunk/ql/src/test/results/clientpositive/infer_bucket_sort.q.out /hive/trunk/ql/src/test/results/clientpositive/ppd2.q.out /hive/trunk/ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out /hive/trunk/ql/src/test/results/compiler/plan/join1.q.xml /hive/trunk/ql/src/test/results/compiler/plan/join2.q.xml /hive/trunk/ql/src/test/results/compiler/plan/join3.q.xml
          Hide
          Phabricator added a comment -

          njain has commented on the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:181 nit: spelling Abstract
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:103 The order in which the rules are specified matter, since in case of exact match for costs,
          the last rule is invoked.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:122 What are the semantics of trustScript ?
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:359 can you add more comments ?

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          BRANCH
          DPAL-592

          ARCANIST PROJECT
          hive

          To: JIRA, hagleitn, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - njain has commented on the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:181 nit: spelling Abstract ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:103 The order in which the rules are specified matter, since in case of exact match for costs, the last rule is invoked. ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:122 What are the semantics of trustScript ? ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:359 can you add more comments ? REVISION DETAIL https://reviews.facebook.net/D1209 BRANCH DPAL-592 ARCANIST PROJECT hive To: JIRA, hagleitn, navis Cc: hagleitn, njain
          Hide
          Phabricator added a comment -

          navis has commented on the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:122 When ScriptOperator exists between RSs, it might possible to dedup only if the script does not change schema, order of rows and values of the RS related columns. It seemed added for that case by He Yongqiang, initial developer of this optimizer.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:103 Added comments
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:359 ok.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:181 done.

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          BRANCH
          DPAL-592

          ARCANIST PROJECT
          hive

          To: JIRA, hagleitn, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - navis has commented on the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:122 When ScriptOperator exists between RSs, it might possible to dedup only if the script does not change schema, order of rows and values of the RS related columns. It seemed added for that case by He Yongqiang, initial developer of this optimizer. ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:103 Added comments ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:359 ok. ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:181 done. REVISION DETAIL https://reviews.facebook.net/D1209 BRANCH DPAL-592 ARCANIST PROJECT hive To: JIRA, hagleitn, navis Cc: hagleitn, njain
          Hide
          Phabricator added a comment -

          njain has commented on the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          INLINE COMMENTS
          ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q:5 can you add some comments in this test ???

          RS + GBY + RS is optimized to remove the last RS etc.
          before the case

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          BRANCH
          DPAL-592

          ARCANIST PROJECT
          hive

          To: JIRA, hagleitn, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - njain has commented on the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". INLINE COMMENTS ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q:5 can you add some comments in this test ??? RS + GBY + RS is optimized to remove the last RS etc. before the case REVISION DETAIL https://reviews.facebook.net/D1209 BRANCH DPAL-592 ARCANIST PROJECT hive To: JIRA, hagleitn, navis Cc: hagleitn, njain
          Hide
          Phabricator added a comment -

          njain has commented on the revision "HIVE-2340 [jira] optimize orderby followed by a groupby".

          INLINE COMMENTS
          ql/src/test/queries/clientpositive/reduce_deduplicate.q:26 can you add more comments here ??

          who sets trustScript ?
          As a user, how can i do that
          What is supposed to happen.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:124 Can you add more comments here ?

          Is minReducer only set for order By ?

          REVISION DETAIL
          https://reviews.facebook.net/D1209

          BRANCH
          DPAL-592

          ARCANIST PROJECT
          hive

          To: JIRA, hagleitn, navis
          Cc: hagleitn, njain

          Show
          Phabricator added a comment - njain has commented on the revision " HIVE-2340 [jira] optimize orderby followed by a groupby". INLINE COMMENTS ql/src/test/queries/clientpositive/reduce_deduplicate.q:26 can you add more comments here ?? who sets trustScript ? As a user, how can i do that What is supposed to happen. ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java:124 Can you add more comments here ? Is minReducer only set for order By ? REVISION DETAIL https://reviews.facebook.net/D1209 BRANCH DPAL-592 ARCANIST PROJECT hive To: JIRA, hagleitn, navis Cc: hagleitn, njain

            People

            • Assignee:
              Navis
              Reporter:
              Navis
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development