Hive
  1. Hive
  2. HIVE-3562

Some limit can be pushed down to map stage

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: None
    • Labels:
      None

      Description

      Queries with limit clause (with reasonable number), for example

      select * from src order by key limit 10;
      

      makes operator tree,
      TS-SEL-RS-EXT-LIMIT-FS

      But LIMIT can be partially calculated in RS, reducing size of shuffling.
      TS-SEL-RS(TOP-N)-EXT-LIMIT-FS

      1. HIVE-3562.D5967.1.patch
        19 kB
        Phabricator
      2. HIVE-3562.D5967.2.patch
        43 kB
        Phabricator
      3. HIVE-3562.D5967.3.patch
        66 kB
        Phabricator
      4. HIVE-3562.D5967.4.patch
        83 kB
        Phabricator
      5. HIVE-3562.D5967.5.patch
        91 kB
        Phabricator
      6. HIVE-3562.D5967.6.patch
        86 kB
        Phabricator
      7. HIVE-3562.D5967.7.patch
        103 kB
        Phabricator
      8. HIVE-3562.D5967.8.patch
        105 kB
        Phabricator
      9. HIVE-3562.D5967.9.patch
        104 kB
        Phabricator

        Issue Links

          Activity

          Hide
          Navis added a comment -

          without pushdown,
          Map output bytes 9312
          Map output records 500

          with pushdown,
          Map output bytes 981
          Map output records 53

          can be optimized further.

          Show
          Navis added a comment - without pushdown, Map output bytes 9312 Map output records 500 with pushdown, Map output bytes 981 Map output records 53 can be optimized further.
          Hide
          Phabricator added a comment -

          navis requested code review of "HIVE-3562 [jira] Some limit can be pushed down to map stage".
          Reviewers: JIRA

          DPAL-1910 Some limit can be pushed down to map stage

          Queries with limit clause (with reasonable number), for example

          select * from src order by key limit 10;

          makes operator tree,
          TS-SEL-RS-EXT-LIMIT-FS

          But LIMIT can be partially calculated in RS, reducing size of shuffling.
          TS-SEL-RS(TOP-N)-EXT-LIMIT-FS

          TEST PLAN
          EMPTY

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/limit_pushdown.q
          ql/src/test/results/clientpositive/limit_pushdown.q.out

          MANAGE HERALD DIFFERENTIAL RULES
          https://reviews.facebook.net/herald/view/differential/

          WHY DID I GET THIS EMAIL?
          https://reviews.facebook.net/herald/transcript/14199/

          To: JIRA, navis

          Show
          Phabricator added a comment - navis requested code review of " HIVE-3562 [jira] Some limit can be pushed down to map stage". Reviewers: JIRA DPAL-1910 Some limit can be pushed down to map stage Queries with limit clause (with reasonable number), for example select * from src order by key limit 10; makes operator tree, TS-SEL-RS-EXT-LIMIT-FS But LIMIT can be partially calculated in RS, reducing size of shuffling. TS-SEL-RS(TOP-N)-EXT-LIMIT-FS TEST PLAN EMPTY REVISION DETAIL https://reviews.facebook.net/D5967 AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/limit_pushdown.q ql/src/test/results/clientpositive/limit_pushdown.q.out MANAGE HERALD DIFFERENTIAL RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/14199/ To: JIRA, navis
          Hide
          Sivaramakrishnan Narayanan added a comment -

          I'm interested in this particular optimization. Let's say the table src have N rows and we're interested in top-K. If the rows in T are in almost descending order and we're interested in ascending Top-K (this is very likely when ordering by timestamps), then the number of memcopies will be N * K. See code fragment:

          +    public boolean isTopN(byte[] key) {
          +      int index = Arrays.binarySearch(keys, key, C);
          +      index = index < 0 ? -index -1 : index;
          +      if (index >= keys.length - 1) {
          +        return false;
          +      }
          +      System.arraycopy(keys, index, keys, index + 1, keys.length - index - 1);
          +      keys[index] = Arrays.copyOf(key, key.length);
          +      return true;
          +    }
          +  }
          

          You could use a linked list, but binary search is not an option in that case.

          An alternate approach to the problem is to use a combination of Hive and Hadoop changes.

          Hadoop change:

          • New parameter map.sort.limitrecords which determines how many records each mapper in a job will send to every reducer
          • When writing out local files after sorting, map-task stops after map.sort.limitrecords records for each reducer
          • Effectively, each mapper sends out its top-K records

          Hive change:

          • Determining when the Top-K optimization is applicable and setting K in ReduceSinkDesc
          • Passing the K value along to MapredWork
          • ExecDriver sets map.sort.limitrecords before executing the job corresponding to the MapredWork

          This change will reduce the amount of I/O that happens on the map-side (writing only 10 rows per reducer as opposed to entire table) and can have a big effect on performance. Furthermore, it is possible to make the sort on the mapper side a top-k sort which can further improve performance - but the deep pocket is really the I/O savings. In my experiments, I see a 5x performance improvement for such queries.

          Show
          Sivaramakrishnan Narayanan added a comment - I'm interested in this particular optimization. Let's say the table src have N rows and we're interested in top-K. If the rows in T are in almost descending order and we're interested in ascending Top-K (this is very likely when ordering by timestamps), then the number of memcopies will be N * K. See code fragment: + public boolean isTopN( byte [] key) { + int index = Arrays.binarySearch(keys, key, C); + index = index < 0 ? -index -1 : index; + if (index >= keys.length - 1) { + return false ; + } + System .arraycopy(keys, index, keys, index + 1, keys.length - index - 1); + keys[index] = Arrays.copyOf(key, key.length); + return true ; + } + } You could use a linked list, but binary search is not an option in that case. An alternate approach to the problem is to use a combination of Hive and Hadoop changes. Hadoop change: New parameter map.sort.limitrecords which determines how many records each mapper in a job will send to every reducer When writing out local files after sorting, map-task stops after map.sort.limitrecords records for each reducer Effectively, each mapper sends out its top-K records Hive change: Determining when the Top-K optimization is applicable and setting K in ReduceSinkDesc Passing the K value along to MapredWork ExecDriver sets map.sort.limitrecords before executing the job corresponding to the MapredWork This change will reduce the amount of I/O that happens on the map-side (writing only 10 rows per reducer as opposed to entire table) and can have a big effect on performance. Furthermore, it is possible to make the sort on the mapper side a top-k sort which can further improve performance - but the deep pocket is really the I/O savings. In my experiments, I see a 5x performance improvement for such queries.
          Hide
          Sivaramakrishnan Narayanan added a comment -

          Apologies, you can use a heap to maintain a top-k as opposed to an array or a linked list.

          You may also want to consider the case where the top-k do not fit in memory. One possibility would be to employ this optimization only if K is less than some threshold.

          This approach has the advantage that it is a Hive-only change and does not depend on a Hadoop change. That is a pretty big plus.

          Show
          Sivaramakrishnan Narayanan added a comment - Apologies, you can use a heap to maintain a top-k as opposed to an array or a linked list. You may also want to consider the case where the top-k do not fit in memory. One possibility would be to employ this optimization only if K is less than some threshold. This approach has the advantage that it is a Hive-only change and does not depend on a Hadoop change. That is a pretty big plus.
          Hide
          Namit Jain added a comment -

          I think it would be simple to keep it a hive only change.
          I agree that Hadoop change is more general, but it may be painful/time-consuming to backport.
          In most of the cases, k would be small, and it makes perfect sense to put a limit on k, as you suggested above.

          What do you think ?
          Should we go with the hive-only change ?

          Show
          Namit Jain added a comment - I think it would be simple to keep it a hive only change. I agree that Hadoop change is more general, but it may be painful/time-consuming to backport. In most of the cases, k would be small, and it makes perfect sense to put a limit on k, as you suggested above. What do you think ? Should we go with the hive-only change ?
          Hide
          Sivaramakrishnan Narayanan added a comment -

          Hive-only change is fine. A heap-based isTopN() implementation will be good.

          Show
          Sivaramakrishnan Narayanan added a comment - Hive-only change is fine. A heap-based isTopN() implementation will be good.
          Hide
          Namit Jain added a comment -

          Cool, can you also review ?
          I will start reviewing too.

          Show
          Namit Jain added a comment - Cool, can you also review ? I will start reviewing too.
          Hide
          Namit Jain added a comment -

          comments on phabricator

          Show
          Namit Jain added a comment - comments on phabricator
          Hide
          Phabricator added a comment -

          njain has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          As per jira comments, can you add a new parameter for the limit on k ?

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:73 Instead of checking for instanceof, can you add a method in Operator() - something like
          canLimitBePushed() with a lot of comments, and then the above operators can have this to true.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:358 Can you add a heap based implementation as suggested in jira ?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:86 Can you add lot of comments here - when is this set ?
          what queries will it benefit etc. ?
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:460 add these in hive-default.xml.template
          ql/src/test/queries/clientpositive/limit_pushdown.q:7 can you add another positive/negative test ?

          explain select value, sum(key) from src group by value limit 10;

          The limit should be used in the 2nd MR job if u r inserting into a table.
          ql/src/test/queries/clientpositive/limit_pushdown.q:10 For the 2 negative queries, can you insert into a table also -
          then the optm. should help

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          To: JIRA, navis
          Cc: njain

          Show
          Phabricator added a comment - njain has commented on the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". As per jira comments, can you add a new parameter for the limit on k ? INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:73 Instead of checking for instanceof, can you add a method in Operator() - something like canLimitBePushed() with a lot of comments, and then the above operators can have this to true. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:358 Can you add a heap based implementation as suggested in jira ? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:86 Can you add lot of comments here - when is this set ? what queries will it benefit etc. ? common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:460 add these in hive-default.xml.template ql/src/test/queries/clientpositive/limit_pushdown.q:7 can you add another positive/negative test ? explain select value, sum(key) from src group by value limit 10; The limit should be used in the 2nd MR job if u r inserting into a table. ql/src/test/queries/clientpositive/limit_pushdown.q:10 For the 2 negative queries, can you insert into a table also - then the optm. should help REVISION DETAIL https://reviews.facebook.net/D5967 To: JIRA, navis Cc: njain
          Hide
          Phabricator added a comment -

          tarball has requested changes to the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          Not sure if I'm following the right protocol here. Marking this as "Request Changes" per my comments.

          INLINE COMMENTS
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:460 Minor suggestion, feel free to ignore Prefer parameter name "hive.limit.twophase.enable".
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:129 Why not limit 0 (unless hive optimizes those away)?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:305 Is this variable not used at all?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:323 This code seems error prone. It has a side effect that's difficult to understand - it is not obvious that the caller must wipe out value before the next row is processed. Any particular reason we're caching it at all?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:346 If o1 and o2 are null, shouldn't this return 0?

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          BRANCH
          DPAL-1910

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - tarball has requested changes to the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Not sure if I'm following the right protocol here. Marking this as "Request Changes" per my comments. INLINE COMMENTS common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:460 Minor suggestion, feel free to ignore Prefer parameter name "hive.limit.twophase.enable". ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:129 Why not limit 0 (unless hive optimizes those away)? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:305 Is this variable not used at all? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:323 This code seems error prone. It has a side effect that's difficult to understand - it is not obvious that the caller must wipe out value before the next row is processed. Any particular reason we're caching it at all? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:346 If o1 and o2 are null, shouldn't this return 0? REVISION DETAIL https://reviews.facebook.net/D5967 BRANCH DPAL-1910 To: JIRA, tarball, navis Cc: njain
          Hide
          Sivaramakrishnan Narayanan added a comment -

          Added my comments to Phabricator. I'm new to these tools and the process. So, please forgive any mistakes on my part.

          Good stuff, Navis!

          Show
          Sivaramakrishnan Narayanan added a comment - Added my comments to Phabricator. I'm new to these tools and the process. So, please forgive any mistakes on my part. Good stuff, Navis!
          Hide
          Navis added a comment -

          I just wanted to start small not affecting flow over operators but.. ok, I'll do it.

          Show
          Navis added a comment - I just wanted to start small not affecting flow over operators but.. ok, I'll do it.
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".
          Reviewers: JIRA, tarball

          Addressed some comments (submitting for fail-fast review)

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/limit_pushdown.q
          ql/src/test/results/clientpositive/limit_pushdown.q.out

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - navis updated the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Reviewers: JIRA, tarball Addressed some comments (submitting for fail-fast review) REVISION DETAIL https://reviews.facebook.net/D5967 AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/limit_pushdown.q ql/src/test/results/clientpositive/limit_pushdown.q.out To: JIRA, tarball, navis Cc: njain
          Hide
          Phabricator added a comment -

          tarball has requested changes to the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          INLINE COMMENTS
          conf/hive-default.xml.template:1407 Why is optimization disabled by default? This is good stuff and should be switched on!
          conf/hive-default.xml.template:1413 10 million seems like a really large threshold. Maybe in the 50k range?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:378 The current implementation doesn't look like a heap to me. Why not simply use java.util.PriorityQueue?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:383 Shouldn't nulls be equal to each other?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:379 Better name? TopNHeap?

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          BRANCH
          DPAL-1910

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - tarball has requested changes to the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". INLINE COMMENTS conf/hive-default.xml.template:1407 Why is optimization disabled by default? This is good stuff and should be switched on! conf/hive-default.xml.template:1413 10 million seems like a really large threshold. Maybe in the 50k range? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:378 The current implementation doesn't look like a heap to me. Why not simply use java.util.PriorityQueue? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:383 Shouldn't nulls be equal to each other? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:379 Better name? TopNHeap? REVISION DETAIL https://reviews.facebook.net/D5967 BRANCH DPAL-1910 To: JIRA, tarball, navis Cc: njain
          Hide
          Phabricator added a comment -

          navis has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          Is this right direction?

          INLINE COMMENTS
          conf/hive-default.xml.template:1407 It's like a convention. When this patch is believed to be fully stable, it'll be so.
          conf/hive-default.xml.template:1413 It's size of memory, not number of rows. 10M seemed to be conservative enough.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:378 I just wanted to keep it compact as is possible because this can be used with map aggregation. I'm thinking of interface which can be implemented by user and configured to be used for this.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:383 In RS, o2 is the key and cannot be null and also, making this class support null key needs some more works. I just ignored that case but yes, some comments could be useful at the least.

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          BRANCH
          DPAL-1910

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - navis has commented on the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Is this right direction? INLINE COMMENTS conf/hive-default.xml.template:1407 It's like a convention. When this patch is believed to be fully stable, it'll be so. conf/hive-default.xml.template:1413 It's size of memory, not number of rows. 10M seemed to be conservative enough. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:378 I just wanted to keep it compact as is possible because this can be used with map aggregation. I'm thinking of interface which can be implemented by user and configured to be used for this. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:383 In RS, o2 is the key and cannot be null and also, making this class support null key needs some more works. I just ignored that case but yes, some comments could be useful at the least. REVISION DETAIL https://reviews.facebook.net/D5967 BRANCH DPAL-1910 To: JIRA, tarball, navis Cc: njain
          Hide
          Namit Jain added a comment -

          comments on phabricator

          Show
          Namit Jain added a comment - comments on phabricator
          Hide
          Phabricator added a comment -

          njain has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          The general direction looks OK

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:79 TODO
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:45 spelling: operator
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:79 Followup: add a new method in Operator.
          ql/src/test/queries/clientpositive/limit_pushdown.q:26 Looks like this optimization should also help if the limit is in a sub-query:
          Can you add a test ?

          something like:

          select .. from
          (select key, count(1) from src group by key order by key limit 2) subq
          join
          (select key, count(1) from src group by key order by key limit 2) subq2 ..

          The optimization should be applied to both the sub-queries

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          BRANCH
          DPAL-1910

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - njain has commented on the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". The general direction looks OK INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:79 TODO ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:45 spelling: operator ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:79 Followup: add a new method in Operator. ql/src/test/queries/clientpositive/limit_pushdown.q:26 Looks like this optimization should also help if the limit is in a sub-query: Can you add a test ? something like: select .. from (select key, count(1) from src group by key order by key limit 2) subq join (select key, count(1) from src group by key order by key limit 2) subq2 .. The optimization should be applied to both the sub-queries REVISION DETAIL https://reviews.facebook.net/D5967 BRANCH DPAL-1910 To: JIRA, tarball, navis Cc: njain
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".
          Reviewers: JIRA, tarball

          Addressed comments
          Prevent multi-GBY single-RS case

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/limit_pushdown.q
          ql/src/test/results/clientpositive/limit_pushdown.q.out

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - navis updated the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Reviewers: JIRA, tarball Addressed comments Prevent multi-GBY single-RS case REVISION DETAIL https://reviews.facebook.net/D5967 AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/limit_pushdown.q ql/src/test/results/clientpositive/limit_pushdown.q.out To: JIRA, tarball, navis Cc: njain
          Hide
          Sivaramakrishnan Narayanan added a comment -

          Looks good to me. +1

          Show
          Sivaramakrishnan Narayanan added a comment - Looks good to me. +1
          Hide
          Phabricator added a comment -

          tarball has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          Looks good to me. +1

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - tarball has commented on the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Looks good to me. +1 REVISION DETAIL https://reviews.facebook.net/D5967 To: JIRA, tarball, navis Cc: njain
          Hide
          Phabricator added a comment -

          njain has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          INLINE COMMENTS
          conf/hive-default.xml.template:1434 Can you add more details here - a example query would really help ?
          ql/src/test/queries/clientpositive/limit_pushdown.q:16 What is so special about 40 ?

          set hive.limit.pushdown.heap.threshold explicitly at the beginning of the test, makes the
          test easier to maintain in the long run.

          ql/src/test/queries/clientpositive/limit_pushdown.q:34 What is the difference between this and line 3 ?

          ql/src/test/queries/clientpositive/limit_pushdown.q:10 I think this plan is not correct.

          Let us say, the values are
          v1
          v2
          ..
          v10
          v11
          v12
          ..
          v20

          The first mapper does not have v8-10, so it emits v1-v7, v11-v13
          The second mapper contains data for all values, but it only emits v1-v10

          Since it does not involves a order by, it is possible that the data for v11 will get picked up, which does not contain data from the second mapper. If you are pushing the limit up, you should create an additional MR job which orders the rows - in the above example, making sure that only v1-v10 are picked up.

          Am I missing something here ?

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - njain has commented on the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". INLINE COMMENTS conf/hive-default.xml.template:1434 Can you add more details here - a example query would really help ? ql/src/test/queries/clientpositive/limit_pushdown.q:16 What is so special about 40 ? set hive.limit.pushdown.heap.threshold explicitly at the beginning of the test, makes the test easier to maintain in the long run. ql/src/test/queries/clientpositive/limit_pushdown.q:34 What is the difference between this and line 3 ? ql/src/test/queries/clientpositive/limit_pushdown.q:10 I think this plan is not correct. Let us say, the values are v1 v2 .. v10 v11 v12 .. v20 The first mapper does not have v8-10, so it emits v1-v7, v11-v13 The second mapper contains data for all values, but it only emits v1-v10 Since it does not involves a order by, it is possible that the data for v11 will get picked up, which does not contain data from the second mapper. If you are pushing the limit up, you should create an additional MR job which orders the rows - in the above example, making sure that only v1-v10 are picked up. Am I missing something here ? REVISION DETAIL https://reviews.facebook.net/D5967 To: JIRA, tarball, navis Cc: njain
          Hide
          Namit Jain added a comment -

          comments

          Show
          Namit Jain added a comment - comments
          Hide
          Phabricator added a comment -

          njain has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:75 remove the TODO
          ql/src/test/queries/clientpositive/limit_pushdown.q:51 There is no test where the limit is > hive.limit.pushdown.heap.threshold.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:87 Do you want to compare the threshold with the actual limit here ?

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - njain has commented on the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:75 remove the TODO ql/src/test/queries/clientpositive/limit_pushdown.q:51 There is no test where the limit is > hive.limit.pushdown.heap.threshold. ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:87 Do you want to compare the threshold with the actual limit here ? REVISION DETAIL https://reviews.facebook.net/D5967 To: JIRA, tarball, navis Cc: njain
          Hide
          Phabricator added a comment -

          njain has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          Sorry, my earlier comments were assuming that the threshold is for number of rows

          INLINE COMMENTS
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:483 Coming to a earlier comment from Sivaramakrishnan Narayanan, would it be simpler if this was the number of rows ?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:414 Define 40 as a constant somewhere

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - njain has commented on the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Sorry, my earlier comments were assuming that the threshold is for number of rows INLINE COMMENTS common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:483 Coming to a earlier comment from Sivaramakrishnan Narayanan, would it be simpler if this was the number of rows ? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:414 Define 40 as a constant somewhere REVISION DETAIL https://reviews.facebook.net/D5967 To: JIRA, tarball, navis Cc: njain
          Hide
          Phabricator added a comment -

          njain has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          I thought about it, even with group bys, my question is still valid.
          I think, there is a bug.

          Do you think it would be simpler to allocate a heap with (upto) topN entries instead - throw
          the memory threshold out. If limit < threshold, use this optimization, otherwise just ignore this
          optimization.

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:441 Isn't there a bug here ?

          You are using keyValues last entry to figure out whether it needs to be expanded or not.
          It may have an issue at the boundary - say entry 40th when a legit. entry is found.
          It might be simpler to pass the fact whether the entry was found or not.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:468 This is not true if an entry is being inserted in between.
          I mean, if topN is 100, and we already have 100 entries.

          If we are inserting 50th entry, we should not be increasing usage

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - njain has commented on the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". I thought about it, even with group bys, my question is still valid. I think, there is a bug. Do you think it would be simpler to allocate a heap with (upto) topN entries instead - throw the memory threshold out. If limit < threshold, use this optimization, otherwise just ignore this optimization. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:441 Isn't there a bug here ? You are using keyValues last entry to figure out whether it needs to be expanded or not. It may have an issue at the boundary - say entry 40th when a legit. entry is found. It might be simpler to pass the fact whether the entry was found or not. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:468 This is not true if an entry is being inserted in between. I mean, if topN is 100, and we already have 100 entries. If we are inserting 50th entry, we should not be increasing usage REVISION DETAIL https://reviews.facebook.net/D5967 To: JIRA, tarball, navis Cc: njain
          Hide
          Phabricator added a comment -

          navis has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          I think I'm wrong about GBY cases. Should be fixed definitely (in next week, maybe).

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - navis has commented on the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". I think I'm wrong about GBY cases. Should be fixed definitely (in next week, maybe). REVISION DETAIL https://reviews.facebook.net/D5967 To: JIRA, tarball, navis Cc: njain
          Hide
          Phabricator added a comment -

          navis has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          I'm also confusing about GBY cases. I mean group by query with "no ordering" clause, for example,

          select key,count(value) from src group by key limit 10;

          I think top-K is still applicable. If key is ranged from K1 to K100, all values for K1~K10 from any mapper are transferred to reducer, which makes valid result. Isn't it?

          And for changing threshold from memory to row count, map aggregation would be a good example, which is using memory percent. I also think current implementation is too clumsy, but general direction is right.

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - navis has commented on the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". I'm also confusing about GBY cases. I mean group by query with "no ordering" clause, for example, select key,count(value) from src group by key limit 10; I think top-K is still applicable. If key is ranged from K1 to K100, all values for K1~K10 from any mapper are transferred to reducer, which makes valid result. Isn't it? And for changing threshold from memory to row count, map aggregation would be a good example, which is using memory percent. I also think current implementation is too clumsy, but general direction is right. REVISION DETAIL https://reviews.facebook.net/D5967 To: JIRA, tarball, navis Cc: njain
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          1. Used Heap for ORDER BY and Map for GROUP BY
          2. Added tests for spill/break
          3. Changed to use percentage for memory threshold

          Reviewers: tarball, JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D5967?vs=24861&id=30483#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/build.xml
          ql/ivy.xml
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHashForGBY.java
          ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/limit_pushdown.q
          ql/src/test/results/clientpositive/limit_pushdown.q.out

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - navis updated the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". 1. Used Heap for ORDER BY and Map for GROUP BY 2. Added tests for spill/break 3. Changed to use percentage for memory threshold Reviewers: tarball, JIRA REVISION DETAIL https://reviews.facebook.net/D5967 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D5967?vs=24861&id=30483#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/build.xml ql/ivy.xml ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHashForGBY.java ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/limit_pushdown.q ql/src/test/results/clientpositive/limit_pushdown.q.out To: JIRA, tarball, navis Cc: njain
          Hide
          Navis added a comment -

          Passed all tests

          Show
          Navis added a comment - Passed all tests
          Hide
          Namit Jain added a comment -

          can you refresh ?

          Show
          Namit Jain added a comment - can you refresh ?
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          Refactoring & bug fix

          Reviewers: tarball, JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D5967?vs=30483&id=32943#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/build.xml
          ql/ivy.xml
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/join_vc.q
          ql/src/test/queries/clientpositive/limit_pushdown.q
          ql/src/test/results/clientpositive/join_vc.q.out
          ql/src/test/results/clientpositive/limit_pushdown.q.out

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - navis updated the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Refactoring & bug fix Reviewers: tarball, JIRA REVISION DETAIL https://reviews.facebook.net/D5967 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D5967?vs=30483&id=32943#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/build.xml ql/ivy.xml ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/join_vc.q ql/src/test/queries/clientpositive/limit_pushdown.q ql/src/test/results/clientpositive/join_vc.q.out ql/src/test/results/clientpositive/limit_pushdown.q.out To: JIRA, tarball, navis Cc: njain
          Hide
          Gopal V added a comment -

          I have implemented a similar fix as a Combiner in MR (LimitNKeys and LimitNValues) instead of using a hash/heap in-memory.

          This seems to be a lot more memory friendly since it doesn't need any extra memory, but it doesn't help the speed of the sort operations as the data size reduction is post-sort.

          And since it works for any writable, it only needs a simple single class implementation (no comparators within the combiner).

          The combiner is only run if there are multiple spills, but if the combiner can be forced by setting "min.num.spills.for.combine" = 0, then we can set a top-k (unique keys/values) selection sort via "map.sort.class" config instead of the default QuickSort impl, without any change to the hadoop core at all.

          Show
          Gopal V added a comment - I have implemented a similar fix as a Combiner in MR (LimitNKeys and LimitNValues) instead of using a hash/heap in-memory. This seems to be a lot more memory friendly since it doesn't need any extra memory, but it doesn't help the speed of the sort operations as the data size reduction is post-sort. And since it works for any writable, it only needs a simple single class implementation (no comparators within the combiner). The combiner is only run if there are multiple spills, but if the combiner can be forced by setting "min.num.spills.for.combine" = 0, then we can set a top-k (unique keys/values) selection sort via "map.sort.class" config instead of the default QuickSort impl, without any change to the hadoop core at all.
          Hide
          Ashutosh Chauhan added a comment -

          Gopal V Your approach sounds interesting, lot less intrusive and memory friendly indeed. If you still have a patch around this, would you like to post it?

          Show
          Ashutosh Chauhan added a comment - Gopal V Your approach sounds interesting, lot less intrusive and memory friendly indeed. If you still have a patch around this, would you like to post it?
          Hide
          Hive QA added a comment -

          Overall: -1 no tests executed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12580643/HIVE-3562.D5967.5.patch

          Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/429/testReport
          Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/429/console

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Tests failed with: NonZeroExitCodeException: Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n '' ]]
          + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128'
          + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128'
          + cd /data/hive-ptest/working/
          + tee /data/hive-ptest/logs/PreCommit-HIVE-Build-429/source-prep.txt
          + mkdir -p maven ivy
          + [[ svn = \s\v\n ]]
          + [[ -n '' ]]
          + [[ -d apache-svn-trunk-source ]]
          + [[ ! -d apache-svn-trunk-source/.svn ]]
          + [[ ! -d apache-svn-trunk-source ]]
          + cd apache-svn-trunk-source
          + svn revert -R .
          Reverted 'serde/src/java/org/apache/hadoop/hive/serde2/ColumnProjectionUtils.java'
          Reverted 'ql/src/test/results/compiler/plan/join2.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input2.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/join3.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input3.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/join4.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input4.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/join5.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input5.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/join6.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input_testxpath2.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input6.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/join7.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input7.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input_testsequencefile.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input8.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/join8.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/union.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input9.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/udf1.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/udf4.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input_testxpath.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/udf6.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input_part1.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/groupby1.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/udf_case.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/groupby2.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/subq.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/groupby3.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/groupby4.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/groupby5.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/groupby6.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/case_sensitivity.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/udf_when.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input20.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/sample1.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/sample2.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/sample3.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/sample4.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/sample5.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/sample6.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/sample7.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/cast1.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/join1.q.xml'
          Reverted 'ql/src/test/results/compiler/plan/input1.q.xml'
          Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestIntegerCompressionReader.java'
          Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestBitFieldReader.java'
          Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestRunLengthIntegerReader.java'
          Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestRunLengthByteReader.java'
          Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestBitPack.java'
          Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInStream.java'
          Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java'
          Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/sarg/TestSearchArgumentImpl.java'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/Reader.java'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/BitFieldReader.java'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthByteReader.java'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/InStream.java'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/ReaderImpl.java'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgumentImpl.java'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java'
          ++ awk '{print $2}'
          ++ egrep -v '^X|^Performing status on external'
          ++ svn status --no-ignore
          + rm -rf build hcatalog/build hcatalog/core/build hcatalog/storage-handlers/hbase/build hcatalog/server-extensions/build hcatalog/webhcat/svr/build hcatalog/webhcat/java-client/build hcatalog/hcatalog-pig-adapter/build common/src/gen ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestRecordReaderImpl.java
          + svn update
          
          Fetching external item into 'hcatalog/src/test/e2e/harness'
          External at revision 1513744.
          
          At revision 1513744.
          + patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh
          + patchFilePath=/data/hive-ptest/working/scratch/build.patch
          + [[ -f /data/hive-ptest/working/scratch/build.patch ]]
          + chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh
          + /data/hive-ptest/working/scratch/smart-apply-patch.sh /data/hive-ptest/working/scratch/build.patch
          The patch does not appear to apply with p0 to p2
          + exit 1
          '
          

          This message is automatically generated.

          Show
          Hive QA added a comment - Overall : -1 no tests executed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12580643/HIVE-3562.D5967.5.patch Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/429/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/429/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Tests failed with: NonZeroExitCodeException: Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n '' ]] + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + cd /data/hive-ptest/working/ + tee /data/hive-ptest/logs/PreCommit-HIVE-Build-429/source-prep.txt + mkdir -p maven ivy + [[ svn = \s\v\n ]] + [[ -n '' ]] + [[ -d apache-svn-trunk-source ]] + [[ ! -d apache-svn-trunk-source/.svn ]] + [[ ! -d apache-svn-trunk-source ]] + cd apache-svn-trunk-source + svn revert -R . Reverted 'serde/src/java/org/apache/hadoop/hive/serde2/ColumnProjectionUtils.java' Reverted 'ql/src/test/results/compiler/plan/join2.q.xml' Reverted 'ql/src/test/results/compiler/plan/input2.q.xml' Reverted 'ql/src/test/results/compiler/plan/join3.q.xml' Reverted 'ql/src/test/results/compiler/plan/input3.q.xml' Reverted 'ql/src/test/results/compiler/plan/join4.q.xml' Reverted 'ql/src/test/results/compiler/plan/input4.q.xml' Reverted 'ql/src/test/results/compiler/plan/join5.q.xml' Reverted 'ql/src/test/results/compiler/plan/input5.q.xml' Reverted 'ql/src/test/results/compiler/plan/join6.q.xml' Reverted 'ql/src/test/results/compiler/plan/input_testxpath2.q.xml' Reverted 'ql/src/test/results/compiler/plan/input6.q.xml' Reverted 'ql/src/test/results/compiler/plan/join7.q.xml' Reverted 'ql/src/test/results/compiler/plan/input7.q.xml' Reverted 'ql/src/test/results/compiler/plan/input_testsequencefile.q.xml' Reverted 'ql/src/test/results/compiler/plan/input8.q.xml' Reverted 'ql/src/test/results/compiler/plan/join8.q.xml' Reverted 'ql/src/test/results/compiler/plan/union.q.xml' Reverted 'ql/src/test/results/compiler/plan/input9.q.xml' Reverted 'ql/src/test/results/compiler/plan/udf1.q.xml' Reverted 'ql/src/test/results/compiler/plan/udf4.q.xml' Reverted 'ql/src/test/results/compiler/plan/input_testxpath.q.xml' Reverted 'ql/src/test/results/compiler/plan/udf6.q.xml' Reverted 'ql/src/test/results/compiler/plan/input_part1.q.xml' Reverted 'ql/src/test/results/compiler/plan/groupby1.q.xml' Reverted 'ql/src/test/results/compiler/plan/udf_case.q.xml' Reverted 'ql/src/test/results/compiler/plan/groupby2.q.xml' Reverted 'ql/src/test/results/compiler/plan/subq.q.xml' Reverted 'ql/src/test/results/compiler/plan/groupby3.q.xml' Reverted 'ql/src/test/results/compiler/plan/groupby4.q.xml' Reverted 'ql/src/test/results/compiler/plan/groupby5.q.xml' Reverted 'ql/src/test/results/compiler/plan/groupby6.q.xml' Reverted 'ql/src/test/results/compiler/plan/case_sensitivity.q.xml' Reverted 'ql/src/test/results/compiler/plan/udf_when.q.xml' Reverted 'ql/src/test/results/compiler/plan/input20.q.xml' Reverted 'ql/src/test/results/compiler/plan/sample1.q.xml' Reverted 'ql/src/test/results/compiler/plan/sample2.q.xml' Reverted 'ql/src/test/results/compiler/plan/sample3.q.xml' Reverted 'ql/src/test/results/compiler/plan/sample4.q.xml' Reverted 'ql/src/test/results/compiler/plan/sample5.q.xml' Reverted 'ql/src/test/results/compiler/plan/sample6.q.xml' Reverted 'ql/src/test/results/compiler/plan/sample7.q.xml' Reverted 'ql/src/test/results/compiler/plan/cast1.q.xml' Reverted 'ql/src/test/results/compiler/plan/join1.q.xml' Reverted 'ql/src/test/results/compiler/plan/input1.q.xml' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestIntegerCompressionReader.java' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestBitFieldReader.java' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestRunLengthIntegerReader.java' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestRunLengthByteReader.java' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestBitPack.java' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInStream.java' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/sarg/TestSearchArgumentImpl.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/Reader.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/BitFieldReader.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthByteReader.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/InStream.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/ReaderImpl.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgumentImpl.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java' ++ awk '{print $2}' ++ egrep -v '^X|^Performing status on external' ++ svn status --no-ignore + rm -rf build hcatalog/build hcatalog/core/build hcatalog/storage-handlers/hbase/build hcatalog/server-extensions/build hcatalog/webhcat/svr/build hcatalog/webhcat/java-client/build hcatalog/hcatalog-pig-adapter/build common/src/gen ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestRecordReaderImpl.java + svn update Fetching external item into 'hcatalog/src/test/e2e/harness' External at revision 1513744. At revision 1513744. + patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hive-ptest/working/scratch/build.patch + [[ -f /data/hive-ptest/working/scratch/build.patch ]] + chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh + /data/hive-ptest/working/scratch/smart-apply-patch.sh /data/hive-ptest/working/scratch/build.patch The patch does not appear to apply with p0 to p2 + exit 1 ' This message is automatically generated.
          Hide
          Gopal V added a comment -

          I have created HIVE-5093 to hold my limit combiner patch (rebased to trunk & turned off by default).

          Show
          Gopal V added a comment - I have created HIVE-5093 to hold my limit combiner patch (rebased to trunk & turned off by default).
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          Rebased to trunk

          Reviewers: tarball, JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D5967?vs=32943&id=38379#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/build.xml
          ql/ivy.xml
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/limit_pushdown.q
          ql/src/test/results/clientpositive/limit_pushdown.q.out

          To: JIRA, tarball, navis
          Cc: njain

          Show
          Phabricator added a comment - navis updated the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Rebased to trunk Reviewers: tarball, JIRA REVISION DETAIL https://reviews.facebook.net/D5967 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D5967?vs=32943&id=38379#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/build.xml ql/ivy.xml ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/limit_pushdown.q ql/src/test/results/clientpositive/limit_pushdown.q.out To: JIRA, tarball, navis Cc: njain
          Hide
          Hive QA added a comment -

          Overall: +1 all checks pass

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12598877/HIVE-3562.D5967.6.patch

          SUCCESS: +1 2886 tests passed

          Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/482/testReport
          Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/482/console

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          

          This message is automatically generated.

          Show
          Hive QA added a comment - Overall : +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12598877/HIVE-3562.D5967.6.patch SUCCESS: +1 2886 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/482/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/482/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase This message is automatically generated.
          Hide
          Ashutosh Chauhan added a comment -

          As always, high quality work, Navis!
          Left few implementation comments on phabricator. Most important ones are:
          1. Is it possible to use one DataStructure (TreeSet probably) for two implementations of ReduceSinkHash. If so, that will simplify this code quite a bit.
          2. Not sure how RS corresponding to distinct will be handled with this optimization. So, I requested for test cases for it.

          Cool stuff. Lets get this in!

          Show
          Ashutosh Chauhan added a comment - As always, high quality work, Navis! Left few implementation comments on phabricator. Most important ones are: 1. Is it possible to use one DataStructure (TreeSet probably) for two implementations of ReduceSinkHash. If so, that will simplify this code quite a bit. 2. Not sure how RS corresponding to distinct will be handled with this optimization. So, I requested for test cases for it. Cool stuff. Lets get this in!
          Hide
          Phabricator added a comment -

          ashutoshc has requested changes to the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          INLINE COMMENTS
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 I think we don't need a boolean flag. If user has set hive.limit.pushdown.memory.usage=0 it is pretty clear he wants the optimization to be disabled.
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 I didnt fully follow the example here. What are the keys/values/rows here?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:154 select key from src limit 0; implies user doesn't want any data, so RS doesnt need to emit any thing, this optimization is still valid and super useful here. Isn't it? So, this should be limit < 0, instead limit <=0
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 Can you add a comment why you are subtracting limit * 64 in this calculation of threshold?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:171 I think we need not to special handle limit = 0 by null'ing the collector. As I said above better is to make RSHash to null in that case. This avoids a null check of collector in processOp().
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 As I noted above, lets not have this null check. Apart from you null'ing the collector above, could there ever be case when it will be null. I dont think so, in that case lets remove this if-branch from inner loop.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:340 Shall this be instead collect(keyWritable, value) ?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 Didn't get your comment for retry here. Can you add more comment for it?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:346 collect() instead of out.collect() ?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436 transient?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:471 It will be good to add a LOG.DEBUG() here saying, disabling ReduceSinkHash.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 I didn't get your TODO here. Can you explain?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 Better name: remove ?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:498 Add a comment here saying values[index] could be null, if flush() was called after this value is added but before remove() is called.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 You are using PriorityQueue almost like a sorted set. I think java.util.TreeSet will suffice what you need here.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 I dont see you making use of any feature corresponding to Map interface. Looks like TreeSet would be sufficient here as well.

          If it so that at both places we can use TreeSet than we do need not these two diff implementation of ReduceSinkHash. This will simplify code quite a bit.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:253-264 This handling of distinct keys case can also be refactored to use ReduceSinkHash. Will be a good idea to take this up in a follow up jira.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 Also need to mark it transient.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:91 If we are able to use one DataStructure in ReduceSinkHash (either TreeSet or some other) for both cases, we wont need this boolean.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46 How are enforcing this condition of applying this optimization only for last RS. Can you add comments about this.
          ql/src/test/queries/clientpositive/limit_pushdown.q:14 Can you also add a test for following query with this optimization on:
          explain select distinct(key) from src limit 20;
          explain select count(distinct(key)) from src group by key limit 20;

          select distinct(key) from src limit 20;
          select count(distinct(key)) from src group by key limit 20;
          ql/src/test/queries/clientpositive/limit_pushdown.q:43 It will be good to seprate these queries where optimization will not be fired in a seprate .q file.

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          BRANCH
          HIVE-3562

          ARCANIST PROJECT
          hive

          To: JIRA, tarball, ashutoshc, navis
          Cc: njain

          Show
          Phabricator added a comment - ashutoshc has requested changes to the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". INLINE COMMENTS common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 I think we don't need a boolean flag. If user has set hive.limit.pushdown.memory.usage=0 it is pretty clear he wants the optimization to be disabled. ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 I didnt fully follow the example here. What are the keys/values/rows here? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:154 select key from src limit 0; implies user doesn't want any data, so RS doesnt need to emit any thing, this optimization is still valid and super useful here. Isn't it? So, this should be limit < 0, instead limit <=0 ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 Can you add a comment why you are subtracting limit * 64 in this calculation of threshold? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:171 I think we need not to special handle limit = 0 by null'ing the collector. As I said above better is to make RSHash to null in that case. This avoids a null check of collector in processOp(). ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 As I noted above, lets not have this null check. Apart from you null'ing the collector above, could there ever be case when it will be null. I dont think so, in that case lets remove this if-branch from inner loop. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:340 Shall this be instead collect(keyWritable, value) ? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 Didn't get your comment for retry here. Can you add more comment for it? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:346 collect() instead of out.collect() ? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436 transient? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:471 It will be good to add a LOG.DEBUG() here saying, disabling ReduceSinkHash. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 I didn't get your TODO here. Can you explain? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 Better name: remove ? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:498 Add a comment here saying values [index] could be null, if flush() was called after this value is added but before remove() is called. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 You are using PriorityQueue almost like a sorted set. I think java.util.TreeSet will suffice what you need here. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 I dont see you making use of any feature corresponding to Map interface. Looks like TreeSet would be sufficient here as well. If it so that at both places we can use TreeSet than we do need not these two diff implementation of ReduceSinkHash. This will simplify code quite a bit. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:253-264 This handling of distinct keys case can also be refactored to use ReduceSinkHash. Will be a good idea to take this up in a follow up jira. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 Also need to mark it transient. ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:91 If we are able to use one DataStructure in ReduceSinkHash (either TreeSet or some other) for both cases, we wont need this boolean. ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46 How are enforcing this condition of applying this optimization only for last RS. Can you add comments about this. ql/src/test/queries/clientpositive/limit_pushdown.q:14 Can you also add a test for following query with this optimization on: explain select distinct(key) from src limit 20; explain select count(distinct(key)) from src group by key limit 20; select distinct(key) from src limit 20; select count(distinct(key)) from src group by key limit 20; ql/src/test/queries/clientpositive/limit_pushdown.q:43 It will be good to seprate these queries where optimization will not be fired in a seprate .q file. REVISION DETAIL https://reviews.facebook.net/D5967 BRANCH HIVE-3562 ARCANIST PROJECT hive To: JIRA, tarball, ashutoshc, navis Cc: njain
          Hide
          Edward Capriolo added a comment -

          1. Is it possible to use one DataStructure (TreeSet probably) for two implementations of ReduceSinkHash. If so, that will simplify this code quite a bit.

          I think we are better off using the MinMaxPriorityQueue

          http://highscalability.com/blog/2013/5/22/strategy-stop-using-linked-lists.html

          Structures that use Node's are really inefficient in java. The pointers bloat the structure, and the non-local memory access and causes a lot of cache misses. I benched a similar scenario sorting large tree sets, and even though the big O is smaller (believe it or not) to sort a list then maintain a treeSet. MinMaxPriorityQueue has a middle-ground implementation which real world works much faster then TreeSet.

          
          
          import java.util.ArrayList;
          import java.util.Collection;
          import java.util.Collections;
          import java.util.LinkedList;
          import java.util.List;
          import java.util.Random;
          import java.util.TreeSet;
          
          public class TreeSetTest {
          
              static Random r = new Random();
             
              public static void main(String [] args){
                 
                  doIt(500000,new TreeSet<Double>());
                  doIt(500000,new TreeSet<Double>());
                  doIt(500000,new TreeSet<Double>());
                  doIt(500000,new TreeSet<Double>());
                  {
                  List<Double> arrayList = new ArrayList<Double>();
                  doIt(500000,arrayList);
                  sortIt(arrayList);
                  }
                  {
                      List<Double> arrayList = new ArrayList<Double>();
                      doIt(500000,arrayList);
                      sortIt(arrayList);
                      }
                  {
                      List<Double> arrayList = new LinkedList<Double>();
                      doIt(500000,arrayList);
                      sortIt(arrayList);
                      }
                 
                  {
                      List<Double> arrayList = new ArrayList<Double>();
                      doIt(500000,arrayList);
                      sortIt(arrayList);
                      }
                  {
                      List<Double> arrayList = new ArrayList<Double>();
                      doIt(500000,arrayList);
                      sortIt(arrayList);
                      }
                  doIt(500000,new TreeSet<Double>());
                  doIt(500000,new TreeSet<Double>());
                  {
                      List<Double> arrayList = new LinkedList<Double>();
                      doIt(500000,arrayList);
                      sortIt(arrayList);
                      }
                 
                 
              }
             
              public static void sortIt(List l){
                  long start = System.currentTimeMillis();
                  Collections.sort(l);
                  long end = System.currentTimeMillis();
                  System.out.println(l.getClass().getName()+" sort it "+ (end-start));
              }
             
              public static void doIt(int n, Collection l){
                  long start = System.currentTimeMillis();
                  for (int i =0;i<n;i++){
                      randomDouble(l);
                  }
                  long end = System.currentTimeMillis();
                  System.out.println(l.getClass().getName()+" add it "+ (end-start));
              }
             
              private static void randomDouble(Collection l){
                  int rangeMin=0;
                  int rangeMax=20000000;
                        
                  double randomValue = rangeMin + (rangeMax - rangeMin) * r.nextDouble();
                  l.add(randomValue);
              }
          }
          
          Show
          Edward Capriolo added a comment - 1. Is it possible to use one DataStructure (TreeSet probably) for two implementations of ReduceSinkHash. If so, that will simplify this code quite a bit. I think we are better off using the MinMaxPriorityQueue http://highscalability.com/blog/2013/5/22/strategy-stop-using-linked-lists.html Structures that use Node's are really inefficient in java. The pointers bloat the structure, and the non-local memory access and causes a lot of cache misses. I benched a similar scenario sorting large tree sets, and even though the big O is smaller (believe it or not) to sort a list then maintain a treeSet. MinMaxPriorityQueue has a middle-ground implementation which real world works much faster then TreeSet. import java.util.ArrayList; import java.util.Collection; import java.util.Collections; import java.util.LinkedList; import java.util.List; import java.util.Random; import java.util.TreeSet; public class TreeSetTest { static Random r = new Random(); public static void main( String [] args){ doIt(500000, new TreeSet< Double >()); doIt(500000, new TreeSet< Double >()); doIt(500000, new TreeSet< Double >()); doIt(500000, new TreeSet< Double >()); { List< Double > arrayList = new ArrayList< Double >(); doIt(500000,arrayList); sortIt(arrayList); } { List< Double > arrayList = new ArrayList< Double >(); doIt(500000,arrayList); sortIt(arrayList); } { List< Double > arrayList = new LinkedList< Double >(); doIt(500000,arrayList); sortIt(arrayList); } { List< Double > arrayList = new ArrayList< Double >(); doIt(500000,arrayList); sortIt(arrayList); } { List< Double > arrayList = new ArrayList< Double >(); doIt(500000,arrayList); sortIt(arrayList); } doIt(500000, new TreeSet< Double >()); doIt(500000, new TreeSet< Double >()); { List< Double > arrayList = new LinkedList< Double >(); doIt(500000,arrayList); sortIt(arrayList); } } public static void sortIt(List l){ long start = System .currentTimeMillis(); Collections.sort(l); long end = System .currentTimeMillis(); System .out.println(l.getClass().getName()+ " sort it " + (end-start)); } public static void doIt( int n, Collection l){ long start = System .currentTimeMillis(); for ( int i =0;i<n;i++){ randomDouble(l); } long end = System .currentTimeMillis(); System .out.println(l.getClass().getName()+ " add it " + (end-start)); } private static void randomDouble(Collection l){ int rangeMin=0; int rangeMax=20000000; double randomValue = rangeMin + (rangeMax - rangeMin) * r.nextDouble(); l.add(randomValue); } }
          Hide
          Ashutosh Chauhan added a comment - - edited

          Navis Can you also add a test case for following query with both this optimization on as well as reducesink dedup optimization on? This will trigger a case for one merged RS for both group by and order by.
          select value,avg(key) from src group by value order by value limit 20;

          Show
          Ashutosh Chauhan added a comment - - edited Navis Can you also add a test case for following query with both this optimization on as well as reducesink dedup optimization on? This will trigger a case for one merged RS for both group by and order by. select value,avg(key) from src group by value order by value limit 20;
          Hide
          Ashutosh Chauhan added a comment -

          Edward Capriolo I dont care which DS you pick, I am looking for simplicity of implementation. Current patch has two implementation for RSHash which only differs on choice of DS and it appears to me that in both cases underlying DS is used in very similar fashion. If one can be used (whichever it is), this will make code simpler to read, understand and less bug prone.

          Show
          Ashutosh Chauhan added a comment - Edward Capriolo I dont care which DS you pick, I am looking for simplicity of implementation. Current patch has two implementation for RSHash which only differs on choice of DS and it appears to me that in both cases underlying DS is used in very similar fashion. If one can be used (whichever it is), this will make code simpler to read, understand and less bug prone.
          Hide
          Edward Capriolo added a comment -

          Makes sense.

          Show
          Edward Capriolo added a comment - Makes sense.
          Hide
          Ashutosh Chauhan added a comment -

          Navis It occurred to me that this optimization will become very powerful in combination with HIVE-4002 Imagine a case where there is a limit which can be pushed up in front of last RS. Than mappers will output very little data and with HIVE-4002 we can eliminate reducer altogether. Though these two optimizations cannot occur simultaneously in their current form since RSHash is implemented inside RS. We need to reimplement RSHash in FS. Alternative approach could be to implement RSHash as an operator which can than be put in front of either RS or FS. What do you think?

          Show
          Ashutosh Chauhan added a comment - Navis It occurred to me that this optimization will become very powerful in combination with HIVE-4002 Imagine a case where there is a limit which can be pushed up in front of last RS. Than mappers will output very little data and with HIVE-4002 we can eliminate reducer altogether. Though these two optimizations cannot occur simultaneously in their current form since RSHash is implemented inside RS. We need to reimplement RSHash in FS. Alternative approach could be to implement RSHash as an operator which can than be put in front of either RS or FS. What do you think?
          Hide
          Yin Huai added a comment -

          Implementing RSHash as an operator sounds good. We can also use it in front of the RS of the right table of left semi join. Right now, there is a GBY. It is not straightforward in the query plan.

          Show
          Yin Huai added a comment - Implementing RSHash as an operator sounds good. We can also use it in front of the RS of the right table of left semi join. Right now, there is a GBY. It is not straightforward in the query plan.
          Hide
          Phabricator added a comment -

          navis has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          Thanks for a review.

          INLINE COMMENTS
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 I thought it's a convention of hive to make configurations for enable/disable and threshold each, which was requested several times till now for me for other issues. I'll address that.
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 Legend : A(a) --> key A, value a, row A(a)

          If each mapper takes rows lite this
          MAP1(RS) : 40(a)-10(b)-30(c)-10(d)-70(e)-80(f)
          MAP2(RS) : 90(g)-80(h)-60-40(j)-30(k)-20(l)
          MAP3(RS) : 40(m)-50-30(o)-30(p)-60(q)-70(r)

          REDUCER : 10(b,d)-20(l)-30(c,k,o)-40(a,j,m)-50-60(i,q)-70(e,r)-80(f,h)-90(g)
          REDUCER(LIMIT 3) : 10(b,d)-20(l)-30(c,k,o)

          with this patch

          MAP1 : 40(a)-10(b)-30(c)-10(d)
          MAP2 : 40(j)-30(k)-20(l)
          MAP3 : 40(m)-50-30(o)-30(p)

          REDUCER : 10(b,d)-20(l)-30(c,k,o)-40(a,j,m)-50
          REDUCER(LIMIT 3) : 10(b,d)-20(l)-30(c,k,o)
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:154 I thought limit 0 is not realistic and can be used for some kind of testing. I'll address that case but making Hash seemed a little much.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 ReduceSinkHash makes arrays for key/value/hashes. It's for compensating those (not exact but I like the number).
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 I think there was a test for RS operator which has no collector. I'll check that.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:340 Right. I've missed that.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 It's possible to add key/value to hash which is flushed right before. But I just forwarded it.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:346 Right.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436 It's not serializable class. But I'll add transient mod.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:471 I'll leave log message in the processOp() method.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 Return value can be false only if it's for GBY. False means the key exists already in RS hash.
          For case A(a)>A(b)>B(c) with limit 5, the first A(a) will be stored into hash. For the second A(b), value b should be stored, which means values should be byte[][][] using additional array for each index consuming more memory. But I suspected that case could be a common case with map aggregation. So I just decided to forward A(b) to output collector.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 It does not remove anything. This is called after index is removed from RS hash.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:498 Good.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 Main difference of HashForOrder from HashForGroupBy is that it should store duplicated keys. As described in comments, A->B->B->C limit 3 should store A,B,B not A,B,C. I believe this behavior can be possibly implemented with simple TreeSet but thought this is way simpler.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 I prefer Map rather than Set but I'll change that.
          ql/src/test/queries/clientpositive/limit_pushdown.q:14 ok
          ql/src/test/queries/clientpositive/limit_pushdown.q:43 ok.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46 This is not exact description. RS for limit (not last RS). I'll change this, too.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:253-264 Looks promising. Need more investigation on implementation of distinct.

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          BRANCH
          HIVE-3562

          ARCANIST PROJECT
          hive

          To: JIRA, tarball, ashutoshc, navis
          Cc: njain

          Show
          Phabricator added a comment - navis has commented on the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Thanks for a review. INLINE COMMENTS common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 I thought it's a convention of hive to make configurations for enable/disable and threshold each, which was requested several times till now for me for other issues. I'll address that. ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 Legend : A(a) --> key A, value a, row A(a) If each mapper takes rows lite this MAP1(RS) : 40(a)-10(b)-30(c)-10(d)-70(e)-80(f) MAP2(RS) : 90(g)-80(h)-60 -40(j)-30(k)-20(l) MAP3(RS) : 40(m)-50 -30(o)-30(p)-60(q)-70(r) REDUCER : 10(b,d)-20(l)-30(c,k,o)-40(a,j,m)-50 -60(i,q)-70(e,r)-80(f,h)-90(g) REDUCER(LIMIT 3) : 10(b,d)-20(l)-30(c,k,o) with this patch MAP1 : 40(a)-10(b)-30(c)-10(d) MAP2 : 40(j)-30(k)-20(l) MAP3 : 40(m)-50 -30(o)-30(p) REDUCER : 10(b,d)-20(l)-30(c,k,o)-40(a,j,m)-50 REDUCER(LIMIT 3) : 10(b,d)-20(l)-30(c,k,o) ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:154 I thought limit 0 is not realistic and can be used for some kind of testing. I'll address that case but making Hash seemed a little much. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 ReduceSinkHash makes arrays for key/value/hashes. It's for compensating those (not exact but I like the number). ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 I think there was a test for RS operator which has no collector. I'll check that. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:340 Right. I've missed that. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 It's possible to add key/value to hash which is flushed right before. But I just forwarded it. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:346 Right. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436 It's not serializable class. But I'll add transient mod. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:471 I'll leave log message in the processOp() method. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 Return value can be false only if it's for GBY. False means the key exists already in RS hash. For case A(a) >A(b) >B(c) with limit 5, the first A(a) will be stored into hash. For the second A(b), value b should be stored, which means values should be byte[][][] using additional array for each index consuming more memory. But I suspected that case could be a common case with map aggregation. So I just decided to forward A(b) to output collector. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 It does not remove anything. This is called after index is removed from RS hash. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:498 Good. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 Main difference of HashForOrder from HashForGroupBy is that it should store duplicated keys. As described in comments, A->B->B->C limit 3 should store A,B,B not A,B,C. I believe this behavior can be possibly implemented with simple TreeSet but thought this is way simpler. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 I prefer Map rather than Set but I'll change that. ql/src/test/queries/clientpositive/limit_pushdown.q:14 ok ql/src/test/queries/clientpositive/limit_pushdown.q:43 ok. ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46 This is not exact description. RS for limit (not last RS). I'll change this, too. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:253-264 Looks promising. Need more investigation on implementation of distinct. REVISION DETAIL https://reviews.facebook.net/D5967 BRANCH HIVE-3562 ARCANIST PROJECT hive To: JIRA, tarball, ashutoshc, navis Cc: njain
          Hide
          Navis added a comment -

          I think RSHash should be a entity that can be used in various operators but not independent operator, because the behavior might be dependent to each operator. In current implementation, I've modified RS only but the better way to implement this is calculating limit for each operators (bottom up way like CP or PPD) and make HASH for them. (Limit for GBY should be handled in GBY and for simple limit, RS, etc.)

          There seemed remain much works to do.

          Show
          Navis added a comment - I think RSHash should be a entity that can be used in various operators but not independent operator, because the behavior might be dependent to each operator. In current implementation, I've modified RS only but the better way to implement this is calculating limit for each operators (bottom up way like CP or PPD) and make HASH for them. (Limit for GBY should be handled in GBY and for simple limit, RS, etc.) There seemed remain much works to do.
          Hide
          Ashutosh Chauhan added a comment -

          I agree that will be quite a bit of work and given that this patch is out there for long, We should try to get this in without major rework. Though, I think hash for FS will make this optimization with HIVE-4002 real cool.

          Show
          Ashutosh Chauhan added a comment - I agree that will be quite a bit of work and given that this patch is out there for long, We should try to get this in without major rework. Though, I think hash for FS will make this optimization with HIVE-4002 real cool.
          Hide
          Navis added a comment -

          Then, FS will be needed for context of limiting rows(limit, key expressions, etc.). I'll consider that, too.

          Show
          Navis added a comment - Then, FS will be needed for context of limiting rows(limit, key expressions, etc.). I'll consider that, too.
          Hide
          Ashutosh Chauhan added a comment -

          Correct. Thanks a lot for considering. I am fine doing that in seprate jira. FSHashSink will expand scope of this jira unnecessarily.

          Show
          Ashutosh Chauhan added a comment - Correct. Thanks a lot for considering. I am fine doing that in seprate jira. FSHashSink will expand scope of this jira unnecessarily.
          Hide
          Phabricator added a comment -

          ashutoshc has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          Good stuff!

          INLINE COMMENTS
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 Yeah I know historically we have done that way. But in my experience having more configs always confuses users instead of helping.
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 Looks good. Can you update the patch with this?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 Sounds good. Please add it in comment.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 Sounds good.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 I see. I think just forwarding in this case is simple and better thing to do. So, lets leave it that way. Can you add a comment saying its possible to retry to add this to hash, but we don't do that yet. Also, collect()
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436 Ah.. right. Than better not to add transient, otherwise will cause confusion.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 I see. Can you add comment here that further optimization like this is possible here. We can do this later.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 OK.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 I see. Missed this difference of duplicates. My major motivation in trying to unifying these two is so that in optimization rule we need not to worry about detecting whether there is a GBY in mapper or not.
          If using TreeSet will not over complicate this, I will suggest that. But if that complicates things too much here, I am fine with current implementation as well. I will let you decide on that.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 Here, using Set is definitely cleaner (and memory efficient) as far as I can see.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46 Thanks.
          ql/src/test/queries/clientpositive/limit_pushdown.q:14 Thanks. Also, as I said in comment add
          select value,avg(key) from src group by value order by value limit 20;
          with reduce sink dedup optimization on.
          ql/src/test/queries/clientpositive/limit_pushdown.q:43 Thanks.

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          BRANCH
          HIVE-3562

          ARCANIST PROJECT
          hive

          To: JIRA, tarball, ashutoshc, navis
          Cc: njain

          Show
          Phabricator added a comment - ashutoshc has commented on the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Good stuff! INLINE COMMENTS common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 Yeah I know historically we have done that way. But in my experience having more configs always confuses users instead of helping. ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 Looks good. Can you update the patch with this? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 Sounds good. Please add it in comment. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 Sounds good. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 I see. I think just forwarding in this case is simple and better thing to do. So, lets leave it that way. Can you add a comment saying its possible to retry to add this to hash, but we don't do that yet. Also, collect() ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436 Ah.. right. Than better not to add transient, otherwise will cause confusion. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 I see. Can you add comment here that further optimization like this is possible here. We can do this later. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 OK. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 I see. Missed this difference of duplicates. My major motivation in trying to unifying these two is so that in optimization rule we need not to worry about detecting whether there is a GBY in mapper or not. If using TreeSet will not over complicate this, I will suggest that. But if that complicates things too much here, I am fine with current implementation as well. I will let you decide on that. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 Here, using Set is definitely cleaner (and memory efficient) as far as I can see. ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46 Thanks. ql/src/test/queries/clientpositive/limit_pushdown.q:14 Thanks. Also, as I said in comment add select value,avg(key) from src group by value order by value limit 20; with reduce sink dedup optimization on. ql/src/test/queries/clientpositive/limit_pushdown.q:43 Thanks. REVISION DETAIL https://reviews.facebook.net/D5967 BRANCH HIVE-3562 ARCANIST PROJECT hive To: JIRA, tarball, ashutoshc, navis Cc: njain
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          Addressed some comments

          Reviewers: ashutoshc, JIRA, tarball

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D5967?vs=38379&id=39009#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/build.xml
          ql/ivy.xml
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java
          ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/limit_pushdown.q
          ql/src/test/queries/clientpositive/limit_pushdown_negative.q
          ql/src/test/results/clientpositive/limit_pushdown.q.out
          ql/src/test/results/clientpositive/limit_pushdown_negative.q.out

          To: JIRA, tarball, ashutoshc, navis
          Cc: njain

          Show
          Phabricator added a comment - navis updated the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Addressed some comments Reviewers: ashutoshc, JIRA, tarball REVISION DETAIL https://reviews.facebook.net/D5967 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D5967?vs=38379&id=39009#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/build.xml ql/ivy.xml ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/limit_pushdown.q ql/src/test/queries/clientpositive/limit_pushdown_negative.q ql/src/test/results/clientpositive/limit_pushdown.q.out ql/src/test/results/clientpositive/limit_pushdown_negative.q.out To: JIRA, tarball, ashutoshc, navis Cc: njain
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          Missed ASF header
          Removed unnecessary array creation in RS

          Reviewers: ashutoshc, JIRA, tarball

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D5967?vs=39009&id=39015#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/build.xml
          ql/ivy.xml
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java
          ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/limit_pushdown.q
          ql/src/test/queries/clientpositive/limit_pushdown_negative.q
          ql/src/test/results/clientpositive/limit_pushdown.q.out
          ql/src/test/results/clientpositive/limit_pushdown_negative.q.out

          To: JIRA, tarball, ashutoshc, navis
          Cc: njain

          Show
          Phabricator added a comment - navis updated the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Missed ASF header Removed unnecessary array creation in RS Reviewers: ashutoshc, JIRA, tarball REVISION DETAIL https://reviews.facebook.net/D5967 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D5967?vs=39009&id=39015#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/build.xml ql/ivy.xml ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/limit_pushdown.q ql/src/test/queries/clientpositive/limit_pushdown_negative.q ql/src/test/results/clientpositive/limit_pushdown.q.out ql/src/test/results/clientpositive/limit_pushdown_negative.q.out To: JIRA, tarball, ashutoshc, navis Cc: njain
          Hide
          Phabricator added a comment -

          ashutoshc has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          Looks pretty good. Just requesting to add few more comments.

          INLINE COMMENTS
          conf/hive-default.xml.template:1586-1590 We can remove this now.
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1186 Just to add more clarity, say something like: we can push the limit above GBY (running in Reducer), since that will generate single row for each group. This doesn't necessarily hold for GBY (running in Mappers), so we don't push limit above it.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:182 It will be good to add comment about what this field is holding. Add a comment saying: This two dimensional array holds key data and a corresponding Union object which contains the tag identifying the aggregate expression for distinct columns.

          Ideally, instead of this 2-D array, we should have probably enhanced HiveKey class for this logic. We should do that in a follow-up jira.
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:267 I didnt follow this logic completely.
          Seems like this is an optimization not to evaluate union object repeatedly and do system copy for it. Can you add a comment explaining this?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:271 seems like it will be null only for all i = 0. If so, better do if (i==0) check ? Also add comment when this will be null and when it will be non-null?
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:260 You made changes to this section, because you found bug or are you purely refactoring this? If you hit upon the bug, can you explain what was it?
          ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java:50-51 It will be good to add comment about what these 2D arrays are holding?
          ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java:52 Also, add comment saying this array holds hashcode for keys.

          Also, add note that indices of all these arrays must line up.
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:82 Nice Comments!
          ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java:34 It will be good to add a javadoc for this class.
          ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java:36 Also javadoc for this interface.

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          To: JIRA, tarball, ashutoshc, navis
          Cc: njain

          Show
          Phabricator added a comment - ashutoshc has commented on the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Looks pretty good. Just requesting to add few more comments. INLINE COMMENTS conf/hive-default.xml.template:1586-1590 We can remove this now. ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1186 Just to add more clarity, say something like: we can push the limit above GBY (running in Reducer), since that will generate single row for each group. This doesn't necessarily hold for GBY (running in Mappers), so we don't push limit above it. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:182 It will be good to add comment about what this field is holding. Add a comment saying: This two dimensional array holds key data and a corresponding Union object which contains the tag identifying the aggregate expression for distinct columns. Ideally, instead of this 2-D array, we should have probably enhanced HiveKey class for this logic. We should do that in a follow-up jira. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:267 I didnt follow this logic completely. Seems like this is an optimization not to evaluate union object repeatedly and do system copy for it. Can you add a comment explaining this? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:271 seems like it will be null only for all i = 0. If so, better do if (i==0) check ? Also add comment when this will be null and when it will be non-null? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:260 You made changes to this section, because you found bug or are you purely refactoring this? If you hit upon the bug, can you explain what was it? ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java:50-51 It will be good to add comment about what these 2D arrays are holding? ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java:52 Also, add comment saying this array holds hashcode for keys. Also, add note that indices of all these arrays must line up. ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:82 Nice Comments! ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java:34 It will be good to add a javadoc for this class. ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java:36 Also javadoc for this interface. REVISION DETAIL https://reviews.facebook.net/D5967 To: JIRA, tarball, ashutoshc, navis Cc: njain
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          Addressed comments

          Reviewers: ashutoshc, JIRA, tarball

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D5967?vs=39015&id=39051#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/build.xml
          ql/ivy.xml
          ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java
          ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          ql/src/test/queries/clientpositive/limit_pushdown.q
          ql/src/test/queries/clientpositive/limit_pushdown_negative.q
          ql/src/test/results/clientpositive/limit_pushdown.q.out
          ql/src/test/results/clientpositive/limit_pushdown_negative.q.out

          To: JIRA, tarball, ashutoshc, navis
          Cc: njain

          Show
          Phabricator added a comment - navis updated the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". Addressed comments Reviewers: ashutoshc, JIRA, tarball REVISION DETAIL https://reviews.facebook.net/D5967 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D5967?vs=39015&id=39051#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/build.xml ql/ivy.xml ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/limit_pushdown.q ql/src/test/queries/clientpositive/limit_pushdown_negative.q ql/src/test/results/clientpositive/limit_pushdown.q.out ql/src/test/results/clientpositive/limit_pushdown_negative.q.out To: JIRA, tarball, ashutoshc, navis Cc: njain
          Hide
          Phabricator added a comment -

          ashutoshc has accepted the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage".

          +1

          REVISION DETAIL
          https://reviews.facebook.net/D5967

          BRANCH
          HIVE-3562

          ARCANIST PROJECT
          hive

          To: JIRA, tarball, ashutoshc, navis
          Cc: njain

          Show
          Phabricator added a comment - ashutoshc has accepted the revision " HIVE-3562 [jira] Some limit can be pushed down to map stage". +1 REVISION DETAIL https://reviews.facebook.net/D5967 BRANCH HIVE-3562 ARCANIST PROJECT hive To: JIRA, tarball, ashutoshc, navis Cc: njain
          Hide
          Navis added a comment -

          I'm on vacation till next tuesday. Feel free to modify this patch, thanks.

          Show
          Navis added a comment - I'm on vacation till next tuesday. Feel free to modify this patch, thanks.
          Hide
          Ashutosh Chauhan added a comment -

          Committed to trunk. Thanks, Navis for your persistence on this one!

          Show
          Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Navis for your persistence on this one!
          Hide
          Gopal V added a comment -

          Good work Navis.

          Let me mark HIVE-5093 as obsoleted by this - no need for that hack anymore.

          Show
          Gopal V added a comment - Good work Navis. Let me mark HIVE-5093 as obsoleted by this - no need for that hack anymore.
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hive-trunk-hadoop2-ptest #74 (See https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/74/)
          HIVE-3562 : Some limit can be pushed down to map stage (Navis via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1518234)

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/conf/hive-default.xml.template
          • /hive/trunk/ql/build.xml
          • /hive/trunk/ql/ivy.xml
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          • /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown.q
          • /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown_negative.q
          • /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown_negative.q.out
          Show
          Hudson added a comment - FAILURE: Integrated in Hive-trunk-hadoop2-ptest #74 (See https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/74/ ) HIVE-3562 : Some limit can be pushed down to map stage (Navis via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1518234 ) /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/conf/hive-default.xml.template /hive/trunk/ql/build.xml /hive/trunk/ql/ivy.xml /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown.q /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown_negative.q /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown.q.out /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown_negative.q.out
          Hide
          Sivaramakrishnan Narayanan added a comment -

          Good stuff, Navis!

          Show
          Sivaramakrishnan Narayanan added a comment - Good stuff, Navis!
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hive-trunk-hadoop1-ptest #142 (See https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/142/)
          HIVE-3562 : Some limit can be pushed down to map stage (Navis via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1518234)

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/conf/hive-default.xml.template
          • /hive/trunk/ql/build.xml
          • /hive/trunk/ql/ivy.xml
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          • /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown.q
          • /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown_negative.q
          • /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown_negative.q.out
          Show
          Hudson added a comment - FAILURE: Integrated in Hive-trunk-hadoop1-ptest #142 (See https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/142/ ) HIVE-3562 : Some limit can be pushed down to map stage (Navis via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1518234 ) /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/conf/hive-default.xml.template /hive/trunk/ql/build.xml /hive/trunk/ql/ivy.xml /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown.q /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown_negative.q /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown.q.out /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown_negative.q.out
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hive-trunk-h0.21 #2295 (See https://builds.apache.org/job/Hive-trunk-h0.21/2295/)
          HIVE-3562 : Some limit can be pushed down to map stage (Navis via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1518234)

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/conf/hive-default.xml.template
          • /hive/trunk/ql/build.xml
          • /hive/trunk/ql/ivy.xml
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          • /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown.q
          • /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown_negative.q
          • /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown_negative.q.out
          Show
          Hudson added a comment - SUCCESS: Integrated in Hive-trunk-h0.21 #2295 (See https://builds.apache.org/job/Hive-trunk-h0.21/2295/ ) HIVE-3562 : Some limit can be pushed down to map stage (Navis via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1518234 ) /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/conf/hive-default.xml.template /hive/trunk/ql/build.xml /hive/trunk/ql/ivy.xml /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown.q /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown_negative.q /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown.q.out /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown_negative.q.out
          Hide
          Hudson added a comment -

          ABORTED: Integrated in Hive-trunk-hadoop2 #387 (See https://builds.apache.org/job/Hive-trunk-hadoop2/387/)
          HIVE-3562 : Some limit can be pushed down to map stage (Navis via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1518234)

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/conf/hive-default.xml.template
          • /hive/trunk/ql/build.xml
          • /hive/trunk/ql/ivy.xml
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
          • /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown.q
          • /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown_negative.q
          • /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown_negative.q.out
          Show
          Hudson added a comment - ABORTED: Integrated in Hive-trunk-hadoop2 #387 (See https://builds.apache.org/job/Hive-trunk-hadoop2/387/ ) HIVE-3562 : Some limit can be pushed down to map stage (Navis via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1518234 ) /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/conf/hive-default.xml.template /hive/trunk/ql/build.xml /hive/trunk/ql/ivy.xml /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ForwardOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown.q /hive/trunk/ql/src/test/queries/clientpositive/limit_pushdown_negative.q /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown.q.out /hive/trunk/ql/src/test/results/clientpositive/limit_pushdown_negative.q.out
          Hide
          Ashutosh Chauhan added a comment -

          This issue has been fixed and released as part of 0.12 release. If you find further issues, please create a new jira and link it to this one.

          Show
          Ashutosh Chauhan added a comment - This issue has been fixed and released as part of 0.12 release. If you find further issues, please create a new jira and link it to this one.
          Hide
          Sergey Shelukhin added a comment -

          Quick question - is it intended that excluded is only counted when the key is thrown out immediately? So if this doesn't happen it is likely to self-disable.
          If the keys all go to the heap and later get evicted, it can still be useful to output less rows esp. if the data size is big and N is comparatively small.

          Show
          Sergey Shelukhin added a comment - Quick question - is it intended that excluded is only counted when the key is thrown out immediately? So if this doesn't happen it is likely to self-disable. If the keys all go to the heap and later get evicted, it can still be useful to output less rows esp. if the data size is big and N is comparatively small.
          Hide
          Navis added a comment -

          Sergey Shelukhin Sorry for late reply. I thought the cost of top-n hash(array copy, memory usage, etc.) would be bigger than possible gain of unknown future. Better policy might be implemented in following issue by someone with good idea.

          Show
          Navis added a comment - Sergey Shelukhin Sorry for late reply. I thought the cost of top-n hash(array copy, memory usage, etc.) would be bigger than possible gain of unknown future. Better policy might be implemented in following issue by someone with good idea.
          Hide
          Sergey Shelukhin added a comment -

          Let me file a jira...

          Show
          Sergey Shelukhin added a comment - Let me file a jira...

            People

            • Assignee:
              Navis
              Reporter:
              Navis
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development