Hive
  1. Hive
  2. HIVE-279

Implement predicate push down for hive queries

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.4.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Push predicates that are expressed in outer queries into inner queries where possible so that rows will get filtered out sooner.
      eg.

      select a., b. from a join b on (a.uid = b.uid) where a.age = 20 and a.gender = 'm'

      current compiler generates the filter predicate in the reducer after the join so all the rows have to be passed from mapper to reducer. by pushing the filter predicate to the mapper, query performance should improve.

      1. hive-279.6.patch
        1.16 MB
        Prasad Chakka
      2. hive-279.5.patch
        208 kB
        Prasad Chakka
      3. hive-279.4.patch
        119 kB
        Prasad Chakka
      4. hive-279.3.patch
        133 kB
        Prasad Chakka
      5. hive-279.2.patch
        79 kB
        Prasad Chakka
      6. hive-279.patch
        77 kB
        Prasad Chakka

        Issue Links

          Activity

          Prasad Chakka created issue -
          Hide
          Prasad Chakka added a comment -

          this is a drop for initial review since i suspect there will be lot of comments . it should work for all cases except for multi-insert queries.

          i have not enabled this by default but added a new config param called hive.optimize.ppd to enable this feature.

          i have not modified existing testcases but added couple of new testcases. will add more while uploading final patch.

          Show
          Prasad Chakka added a comment - this is a drop for initial review since i suspect there will be lot of comments . it should work for all cases except for multi-insert queries. i have not enabled this by default but added a new config param called hive.optimize.ppd to enable this feature. i have not modified existing testcases but added couple of new testcases. will add more while uploading final patch.
          Prasad Chakka made changes -
          Field Original Value New Value
          Attachment hive-279.patch [ 12400471 ]
          Hide
          Prasad Chakka added a comment -

          Here is an updated patch that simplifies some of the column reference conversions. Now every operator converts the column references in the predicate expressions so that they will be understood by the parent operator(s)

          Show
          Prasad Chakka added a comment - Here is an updated patch that simplifies some of the column reference conversions. Now every operator converts the column references in the predicate expressions so that they will be understood by the parent operator(s)
          Prasad Chakka made changes -
          Attachment hive-279.2.patch [ 12400893 ]
          Hide
          Namit Jain added a comment -

          Some high level comments:

          1. Add more comments everywhere, specifically in joinPPD (OpProcFactory)
          2. Remove operator specific code in ExprWalkerProcFactory: ColumnExprProcessor: process
          3. Use specific data-structures where-ever possible instead of using more generic data-structures.

          ExprWalkerInfo:

          private Map<String, List<Node>> pushdownPreds;
          private Map<Node, ExprInfo> exprInfoMap;

          In both of them, Node means exprNodeDesc, why dont we use that instead ?

          Simlarly, in OpWalkerInfo:

          private Map<Node, ExprWalkerInfo> opToPrunedPredsMap;
          private Map<Operator<? extends Serializable>, OpParseContext> opToParseCtxMap;

          use Operator instead of Node in opToPrunedPredsMap

          4. Can you move OpWalker and ExprWalker in different directories ?
          5. Why are filters only pushed on top of TableScan - cant it be done anywhere. - If you want to do so in a follow-up, can you file a JIRA for that ?
          6. No apache header in many files (ppd directory)

          SemanticAnalyzer.java:

          A comment explaining the reason for existence of colInfoMap will help. Give an example: group by
          where the table column order is different from the grouped column order.

          Same for posAliasMap, nameToInputColumnInfoMap for JOIN

          genJoinOperatorChildren:

          if(aliases == null)

          { aliases = new HashSet<String>(); posToAliasMap.put(pos, aliases); }

          isn't the IF redundant ?

          Show
          Namit Jain added a comment - Some high level comments: 1. Add more comments everywhere, specifically in joinPPD (OpProcFactory) 2. Remove operator specific code in ExprWalkerProcFactory: ColumnExprProcessor: process 3. Use specific data-structures where-ever possible instead of using more generic data-structures. ExprWalkerInfo: private Map<String, List<Node>> pushdownPreds; private Map<Node, ExprInfo> exprInfoMap; In both of them, Node means exprNodeDesc, why dont we use that instead ? Simlarly, in OpWalkerInfo: private Map<Node, ExprWalkerInfo> opToPrunedPredsMap; private Map<Operator<? extends Serializable>, OpParseContext> opToParseCtxMap; use Operator instead of Node in opToPrunedPredsMap 4. Can you move OpWalker and ExprWalker in different directories ? 5. Why are filters only pushed on top of TableScan - cant it be done anywhere. - If you want to do so in a follow-up, can you file a JIRA for that ? 6. No apache header in many files (ppd directory) SemanticAnalyzer.java: A comment explaining the reason for existence of colInfoMap will help. Give an example: group by where the table column order is different from the grouped column order. Same for posAliasMap, nameToInputColumnInfoMap for JOIN genJoinOperatorChildren: if(aliases == null) { aliases = new HashSet<String>(); posToAliasMap.put(pos, aliases); } isn't the IF redundant ?
          Hide
          Prasad Chakka added a comment -

          1) will do
          2) can't remove all op specific code here. gby still needs to check whether this column is part of keys or values and value cols can't be pushed.
          3) will do when possible without doing too much class casting
          4) once i use specific classes instead of Node, things should be more clear and I don't see any reason in moving to a separate directory. They are already different classes so not sure the benefit of moving to separate directories
          5) will open a new JIRA once I am done with this

          colInfoMap will go away in the next patch.

          Show
          Prasad Chakka added a comment - 1) will do 2) can't remove all op specific code here. gby still needs to check whether this column is part of keys or values and value cols can't be pushed. 3) will do when possible without doing too much class casting 4) once i use specific classes instead of Node, things should be more clear and I don't see any reason in moving to a separate directory. They are already different classes so not sure the benefit of moving to separate directories 5) will open a new JIRA once I am done with this colInfoMap will go away in the next patch.
          Hide
          Prasad Chakka added a comment -

          incorporated comments from Namit and added a bunch of new testcases.

          due to changes to exprNodeDesc, the outputs of all parse unit tests have changed. I will upload them in couple of days. otherwise there are no pending code changes for this JIRA.

          will open jiras for the following further optimizatons
          1) mult-insert
          2) intermediate filter operators for partial pushdown
          3) remove pushed preds from original operator to eliminate repeat evaluation.
          4) merge successive filter ops into single op

          Show
          Prasad Chakka added a comment - incorporated comments from Namit and added a bunch of new testcases. due to changes to exprNodeDesc, the outputs of all parse unit tests have changed. I will upload them in couple of days. otherwise there are no pending code changes for this JIRA. will open jiras for the following further optimizatons 1) mult-insert 2) intermediate filter operators for partial pushdown 3) remove pushed preds from original operator to eliminate repeat evaluation. 4) merge successive filter ops into single op
          Prasad Chakka made changes -
          Attachment hive-279.3.patch [ 12401190 ]
          Raghotham Murthy made changes -
          Component/s Query Processor [ 12312586 ]
          Hide
          Prasad Chakka added a comment -

          updated the patch to the latest trunk

          Show
          Prasad Chakka added a comment - updated the patch to the latest trunk
          Prasad Chakka made changes -
          Attachment hive-279.4.patch [ 12403832 ]
          Hide
          Namit Jain added a comment -

          Add comments in tests:

          For eg: ppd_gby.q : the src1.c1 > 'val_200' is pushed up, but the other is not etc.

          More tests needed:

          groupby followed by groupby
          groupby followed by join
          3-way join not being merged
          3-way join being merged
          outer join various scenarios
          etc.

          add a test for multi-table insert also, where ppd is not happening.

          rand() being undeterministic - I think that change has already been merged by Raghu

          PredicatePushdown.java: line 45: pusehed -> pushed

          Optimizer.java:

          columnPruner should be done after ppd, since it regenerates the operator tree.
          Can you add a test for that ? I think ppd will not happen - need to confirm via a test

          It might be a debugging nightmare - can you add a LOG trace/info in OpProcFactory
          (minimally in TableScanPPD - ideally everywhere.

          In SemanticAnalyzer: the colPosMap is not maintained in genReduceSinkPlan :
          although the RR does not change, it might be a good idea to add a test for the same.
          ppd after cluster by

          ExprWalkerProcFactory:72
          if(exp == null)

          { ctx.setIsCandidate(colref, false); return false; }

          I am assuming exp can be null only because colExprMap is not maintained in some cases
          (for eg: group by exprs.)

          Is that true ?
          If yes, Can you add a comment for the same ?
          If no, can you explain that ?

          83: ctx.setIsCandidate(colref, true);
          redundant

          112: cant u break out of the loop if isCandidate is false

          OpProcFactory:

          128: the order of parents of children of tablescan can be lost, change parent at that position

          247: if(aliases.size() == 1 && aliases.contains(""))

          { // Reduce sink of group by operator aliases = null; }

          Instead of this, do you want to add a parameter to mergeWithChildrenPred() – allAliasesOk

          null and empty aliases are differentiated in mergeWi..() in a bizarre way, it might be easier
          to understand with a seperate parameter

          Some cleanup:

          JoinOperator: posToAliasMap --> cant it me moved to ParseContext instead ?
          same for colExprMap -> or it can be moved to OpParseContext ?
          They are all parse time structures.

          Show
          Namit Jain added a comment - Add comments in tests: For eg: ppd_gby.q : the src1.c1 > 'val_200' is pushed up, but the other is not etc. More tests needed: groupby followed by groupby groupby followed by join 3-way join not being merged 3-way join being merged outer join various scenarios etc. add a test for multi-table insert also, where ppd is not happening. rand() being undeterministic - I think that change has already been merged by Raghu PredicatePushdown.java: line 45: pusehed -> pushed Optimizer.java: columnPruner should be done after ppd, since it regenerates the operator tree. Can you add a test for that ? I think ppd will not happen - need to confirm via a test It might be a debugging nightmare - can you add a LOG trace/info in OpProcFactory (minimally in TableScanPPD - ideally everywhere. In SemanticAnalyzer: the colPosMap is not maintained in genReduceSinkPlan : although the RR does not change, it might be a good idea to add a test for the same. ppd after cluster by ExprWalkerProcFactory:72 if(exp == null) { ctx.setIsCandidate(colref, false); return false; } I am assuming exp can be null only because colExprMap is not maintained in some cases (for eg: group by exprs.) Is that true ? If yes, Can you add a comment for the same ? If no, can you explain that ? 83: ctx.setIsCandidate(colref, true); redundant 112: cant u break out of the loop if isCandidate is false OpProcFactory: 128: the order of parents of children of tablescan can be lost, change parent at that position 247: if(aliases.size() == 1 && aliases.contains("")) { // Reduce sink of group by operator aliases = null; } Instead of this, do you want to add a parameter to mergeWithChildrenPred() – allAliasesOk null and empty aliases are differentiated in mergeWi..() in a bizarre way, it might be easier to understand with a seperate parameter Some cleanup: JoinOperator: posToAliasMap --> cant it me moved to ParseContext instead ? same for colExprMap -> or it can be moved to OpParseContext ? They are all parse time structures.
          Hide
          Prasad Chakka added a comment -

          will add the tests & take care of minor comments

          colExprMap & posToAliasMap are just like any other transient fields in the operator and not needed beyond predicate pushdown. so i am going to leave them there.

          TableScan operator can't have more than one child or rather any children of TableScan can't have more than one parent. So changing the order doesn't matter and this is assumed in lot of places. But I will change it so that the order is unchanged.

          Show
          Prasad Chakka added a comment - will add the tests & take care of minor comments colExprMap & posToAliasMap are just like any other transient fields in the operator and not needed beyond predicate pushdown. so i am going to leave them there. TableScan operator can't have more than one child or rather any children of TableScan can't have more than one parent. So changing the order doesn't matter and this is assumed in lot of places. But I will change it so that the order is unchanged.
          Hide
          Prasad Chakka added a comment -

          updated patch incorporating Namit's comments

          Show
          Prasad Chakka added a comment - updated patch incorporating Namit's comments
          Prasad Chakka made changes -
          Attachment hive-279.5.patch [ 12404101 ]
          Hide
          Ashish Thusoo added a comment -

          +1

          Looks good to me.

          A few things to watch out for before you upload the complete patch..

          1. ppd_gby_join.q is missing.
          2. Would be great to have a join test case where the left and right ourter joins are part of the same join operator.
          3. Also would be great to have a join test case where there is a three way join and the join operators do not merge.
          4. Also a test case with UDFs (not the rand one) that gets pushed down would also be awesome.

          Otherwise looks quite cool to me.

          Show
          Ashish Thusoo added a comment - +1 Looks good to me. A few things to watch out for before you upload the complete patch.. 1. ppd_gby_join.q is missing. 2. Would be great to have a join test case where the left and right ourter joins are part of the same join operator. 3. Also would be great to have a join test case where there is a three way join and the join operators do not merge. 4. Also a test case with UDFs (not the rand one) that gets pushed down would also be awesome. Otherwise looks quite cool to me.
          Hide
          Prasad Chakka added a comment -

          1) I missed that file in the patch
          2) Will add it
          3) This test case is already there (ppd_join2.q or ppd_join3.q)
          4) Will add it, but aren't all operators UDFs?

          Show
          Prasad Chakka added a comment - 1) I missed that file in the patch 2) Will add it 3) This test case is already there (ppd_join2.q or ppd_join3.q) 4) Will add it, but aren't all operators UDFs?
          Hide
          Prasad Chakka added a comment -

          added following tests
          1) multi insert (which doesn't push predicates)
          2) query containing left and right outer joins

          This patch is intended to be committed.

          Show
          Prasad Chakka added a comment - added following tests 1) multi insert (which doesn't push predicates) 2) query containing left and right outer joins This patch is intended to be committed.
          Prasad Chakka made changes -
          Attachment hive-279.6.patch [ 12404582 ]
          Prasad Chakka made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hadoop Flags [Reviewed]
          Fix Version/s 0.4.0 [ 12313714 ]
          Hide
          Ashish Thusoo added a comment -

          +1

          Running tests now. Will commit once the tests pass.

          Show
          Ashish Thusoo added a comment - +1 Running tests now. Will commit once the tests pass.
          Hide
          Ashish Thusoo added a comment -

          committed. Thanks Prasad!!

          Show
          Ashish Thusoo added a comment - committed. Thanks Prasad!!
          Ashish Thusoo made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Zheng Shao made changes -
          Affects Version/s 0.6.0 [ 12314524 ]
          Affects Version/s 0.2.0 [ 12313565 ]
          Carl Steinbach made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Lefty Leverenz made changes -
          Link This issue is related to HIVE-2337 [ HIVE-2337 ]

            People

            • Assignee:
              Prasad Chakka
              Reporter:
              Prasad Chakka
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development