Hive
  1. Hive
  2. HIVE-578

Refactor partition pruning code as an optimizer transformation

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0
    • Fix Version/s: 0.4.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Some bugs with partition pruning have been reported and the correct fix for many of them is to rewrite the partition pruning code as an optimizer transformation which gets kicked in after the predicate pushdown code. This refactor also uses the graph walker framework so that the partition pruning code gets consolidated well with the frameworks and does not work on the query block but rather works on the operator tree.

      1. patch-578_1.txt
        1.03 MB
        Ashish Thusoo
      2. patch-578_2.txt
        1.22 MB
        Ashish Thusoo
      3. patch-578_3.txt
        1.25 MB
        Ashish Thusoo
      4. patch-578.txt
        68 kB
        Ashish Thusoo

        Issue Links

          Activity

          Ashish Thusoo created issue -
          Hide
          Ashish Thusoo added a comment -

          This a preliminary code patch. I have not yet plugged this into the SemanticAnalyzer phase but comments are welcome in order to catch any early omissions. This introduces test diffs which I will upload with the final patch.

          Show
          Ashish Thusoo added a comment - This a preliminary code patch. I have not yet plugged this into the SemanticAnalyzer phase but comments are welcome in order to catch any early omissions. This introduces test diffs which I will upload with the final patch.
          Ashish Thusoo made changes -
          Field Original Value New Value
          Attachment patch-578.txt [ 12411658 ]
          Hide
          Namit Jain added a comment -

          Some high-level comments:

          1. Can you share some expression walker code from Predicate Pushdown ? seems like a lot of common stuff.
          2. Can you always assume that the predicate to be filtered will always be after the tablescan ?

          Currently, all optimizations are independent of each other, and can be turned off - this way,
          you are dependent on a new feature – may have stability issues.

          I don't remember exactly how does predicate pushdown work - if a filter is already present immediately after the table
          scan, will the pushed up filter be merged. Is it pushed across user defined mappers/reducers ? Cant these things change
          in the future ?

          It might be a good idea to look for TS*FIL, and then keep some mapping - or probably use the same mapping that Predicate
          pushdown has to figure out whether the filter is for the table under consideration.

          Show
          Namit Jain added a comment - Some high-level comments: 1. Can you share some expression walker code from Predicate Pushdown ? seems like a lot of common stuff. 2. Can you always assume that the predicate to be filtered will always be after the tablescan ? Currently, all optimizations are independent of each other, and can be turned off - this way, you are dependent on a new feature – may have stability issues. I don't remember exactly how does predicate pushdown work - if a filter is already present immediately after the table scan, will the pushed up filter be merged. Is it pushed across user defined mappers/reducers ? Cant these things change in the future ? It might be a good idea to look for TS*FIL, and then keep some mapping - or probably use the same mapping that Predicate pushdown has to figure out whether the filter is for the table under consideration.
          Hide
          Ashish Thusoo added a comment -

          Actually the walker is already shared. However, the actions encapsulated in the ExprProcFactory etc. are by definition different from predicate pushdown and partition pruning so there is no sharing there.

          For 2 I checked with Prasad, and this assumption is correct. Also if there are multiple filter predicates after TS, there are all merged together and a new filter predicate is generated under the TS node, so looking for TS%FIL% should be sufficient.

          Regarding the optimization, I will keep the old pruning code around and toggle this new functionality only when predicate pushdown in enabled. Also I am going to add a number of tests to this before it goes in.

          Show
          Ashish Thusoo added a comment - Actually the walker is already shared. However, the actions encapsulated in the ExprProcFactory etc. are by definition different from predicate pushdown and partition pruning so there is no sharing there. For 2 I checked with Prasad, and this assumption is correct. Also if there are multiple filter predicates after TS, there are all merged together and a new filter predicate is generated under the TS node, so looking for TS%FIL% should be sufficient. Regarding the optimization, I will keep the old pruning code around and toggle this new functionality only when predicate pushdown in enabled. Also I am going to add a number of tests to this before it goes in.
          Raghotham Murthy made changes -
          Summary Refator partition pruning code as an optimizer transformation in order to fix bugs that appear in the current code Refactor partition pruning code as an optimizer transformation
          Hide
          Raghotham Murthy added a comment -

          I am trying to make similar changes to input pruning for sampling.

          A few questions:

          How come we have both tableDesc and tableScanDesc? I see that tableDesc is used in FetchWork and tableScanDesc in TableScanOperator. Does it make sense to make FetchWork use TableScanOperator (FetchOperator is not really an operator)? Then, merge tableDesc and tableScanDesc? This will also allow us to have the same code path for select * queries as well. Right now it looks like FetchOperator is duplicating some code. Also, sample pruning can use the merged tableScanDesc object to store the sampling information.

          Show
          Raghotham Murthy added a comment - I am trying to make similar changes to input pruning for sampling. A few questions: How come we have both tableDesc and tableScanDesc? I see that tableDesc is used in FetchWork and tableScanDesc in TableScanOperator. Does it make sense to make FetchWork use TableScanOperator (FetchOperator is not really an operator)? Then, merge tableDesc and tableScanDesc? This will also allow us to have the same code path for select * queries as well. Right now it looks like FetchOperator is duplicating some code. Also, sample pruning can use the merged tableScanDesc object to store the sampling information.
          Hide
          Ashish Thusoo added a comment -

          yes that makes sense. In fact that needs to be done to make pruning work with select * from T where T.part = xyz kind of queries and I ran into that a few days back. I have punted that for the first stage and still rely on the old pruning stuff for such queries and when predicate pushdown is switched off. I have it mostly passing the tests except for sampling tests where in the predicate pushdown code seems not to be merging the two filter operators and I have made that assumption in my code. Will update today with the version sans the new tests and some cleanup. That way we can see how sampling stuff can work with this as well.

          Show
          Ashish Thusoo added a comment - yes that makes sense. In fact that needs to be done to make pruning work with select * from T where T.part = xyz kind of queries and I ran into that a few days back. I have punted that for the first stage and still rely on the old pruning stuff for such queries and when predicate pushdown is switched off. I have it mostly passing the tests except for sampling tests where in the predicate pushdown code seems not to be merging the two filter operators and I have made that assumption in my code. Will update today with the version sans the new tests and some cleanup. That way we can see how sampling stuff can work with this as well.
          Hide
          Ashish Thusoo added a comment -

          Actually semantics with table sampling is a bit tricky in case the sampling function is non deterministic e.g (rand()). In that case there are two possibilities:

          1. The sampling is done on the whole table and then the predicate is applied which implies no partition pruning.
          or
          2. The sampling is done after the partition predicate is applied ie. it is done after pruning.

          2 is what we do today - I am ok with keeping that, but just wanted to call that out as a special case in the partition pruning code.

          Show
          Ashish Thusoo added a comment - Actually semantics with table sampling is a bit tricky in case the sampling function is non deterministic e.g (rand()). In that case there are two possibilities: 1. The sampling is done on the whole table and then the predicate is applied which implies no partition pruning. or 2. The sampling is done after the partition predicate is applied ie. it is done after pruning. 2 is what we do today - I am ok with keeping that, but just wanted to call that out as a special case in the partition pruning code.
          Hide
          Ashish Thusoo added a comment -

          New patch. This is now passing all the tests. I am going to add some more tests and refactor some of the state from ParseContext to tableScanDesc.

          Show
          Ashish Thusoo added a comment - New patch. This is now passing all the tests. I am going to add some more tests and refactor some of the state from ParseContext to tableScanDesc.
          Ashish Thusoo made changes -
          Attachment patch-578_1.txt [ 12414294 ]
          Hide
          Prasad Chakka added a comment -

          Some comments.

          for(String partName: Hive.get().getPartitionNames(MetaStoreUtils.DEFAULT_DATABASE_NAME, tab.getName(), (short) -1))

          1. should use tab.getDbName() instead.

          2. PartitionPruner::prune() should return partSpecs or partNames instead of partition objects and avoid calling getPartition() on all partNames. this would lower the unnecessary load on metastore.

                if (HiveConf.getBoolVar(hiveConf, HiveConf.ConfVars.HIVEOPTPPD)) {
                 transformations.add(new PredicatePushDown());
          +      transformations.add(new PartitionPruner());
               }
          
          

          3. you should add new config param instead of piggybacking on PPD.

          4. isn't ColumnInfo.internalName unique among all tables? is tableName an additional information that is needed? Can't you get this info from RowResolver? Same question on exprNodeColDesc.

          5. don't understand the change in DefaultGraphWalker.java

          6. can't most of the functionality from ql/parse/PartitionPruner.java class can't be moved to the new class. atleast change the name so that it is not confusing as which pruner is doing what.

          7. import org.apache.hadoop.hive.ql.ppd.PredicatePushDown; is not needed in SemanticAnalyzer.java

          8.

          Show
          Prasad Chakka added a comment - Some comments. for(String partName: Hive.get().getPartitionNames(MetaStoreUtils.DEFAULT_DATABASE_NAME, tab.getName(), (short) -1)) 1. should use tab.getDbName() instead. 2. PartitionPruner::prune() should return partSpecs or partNames instead of partition objects and avoid calling getPartition() on all partNames. this would lower the unnecessary load on metastore. if (HiveConf.getBoolVar(hiveConf, HiveConf.ConfVars.HIVEOPTPPD)) { transformations.add( new PredicatePushDown()); + transformations.add( new PartitionPruner()); } 3. you should add new config param instead of piggybacking on PPD. 4. isn't ColumnInfo.internalName unique among all tables? is tableName an additional information that is needed? Can't you get this info from RowResolver? Same question on exprNodeColDesc. 5. don't understand the change in DefaultGraphWalker.java 6. can't most of the functionality from ql/parse/PartitionPruner.java class can't be moved to the new class. atleast change the name so that it is not confusing as which pruner is doing what. 7. import org.apache.hadoop.hive.ql.ppd.PredicatePushDown; is not needed in SemanticAnalyzer.java 8.
          Hide
          Namit Jain added a comment -

          Some more comments:

          1. Add lots of new tests
          2. Wrong comments in OpProcFactory:

          // If fop2 exists (i.e this is not the top level filter and fop2 is not
          // a sampling filter t1ehen we ignore the current filter
          if (fop2 != null && !fop2.getConf().getIsSamplingPred())
          return null;

          // ignore the predicate in case it is not a sampling predicate
          if (fop.getConf().getIsSamplingPred())

          { return null; }



          should be:


          // If fop2 exists (i.e this is not the top level filter and fop2 is not
          // a sampling filter then we ignore the current filter
          if (fop2 != null && !fop2.getConf().getIsSamplingPred())
          return null;

          // ignore the predicate in case it is a sampling predicate
          if (fop.getConf().getIsSamplingPred()) { return null; }

          3. ExprProcFactory.java:

          wrong parameters doc:
          /**

          • Extracts pushdown predicates from the given list of predicate expression
          • @param opContext operator context used for resolving column references
          • @param op operator of the predicates being processed
          • @param preds
          • @return hasNonPartCols returns true/false depending upon whether this pred has a non partition column
          • @throws SemanticException
            */

          4. ExprProcFactory: 104/146/182
          shouldnt you check for exprNodeNullDesc also

          5. wrong comment: 63
          /**

          • Converts the reference from child row resolver to current row resolver
            */
          Show
          Namit Jain added a comment - Some more comments: 1. Add lots of new tests 2. Wrong comments in OpProcFactory: // If fop2 exists (i.e this is not the top level filter and fop2 is not // a sampling filter t1ehen we ignore the current filter if (fop2 != null && !fop2.getConf().getIsSamplingPred()) return null; // ignore the predicate in case it is not a sampling predicate if (fop.getConf().getIsSamplingPred()) { return null; } should be: // If fop2 exists (i.e this is not the top level filter and fop2 is not // a sampling filter then we ignore the current filter if (fop2 != null && !fop2.getConf().getIsSamplingPred()) return null; // ignore the predicate in case it is a sampling predicate if (fop.getConf().getIsSamplingPred()) { return null; } 3. ExprProcFactory.java: wrong parameters doc: /** Extracts pushdown predicates from the given list of predicate expression @param opContext operator context used for resolving column references @param op operator of the predicates being processed @param preds @return hasNonPartCols returns true/false depending upon whether this pred has a non partition column @throws SemanticException */ 4. ExprProcFactory: 104/146/182 shouldnt you check for exprNodeNullDesc also 5. wrong comment: 63 /** Converts the reference from child row resolver to current row resolver */
          Hide
          Ashish Thusoo added a comment -

          @Namit
          1. I have added a bunch of tests with the new patch.
          2/3/5. Fixed the typos and wrong comments and javadocs
          4. I don't think I have to check for exprNodeNullDesc. The unknown value of an operand is denoted by a null value in exprNodeConstantDesc. exprNodeNullDesc is used to denote the liternal NULL which should be treated as is for predicate pruning expression. e.g. if(ISNULL(NULL), 1, 0) should evaluate to 1 instead of unknown. So I think it is correct to not check the exprNodeNullDesc for the function processors. Thoughts?

          @Prasad
          3. I have used the ppd parameter because I cannot do partition pruning without ppd. So even if I have a separate parameter it would still have to be checked along with ppd. At that point, it seems having a separate parameter for ppd does not add much value. Thoughts?
          7. Removed
          6. I have changed the name ot ASTPartitionPruner
          5. This fixes a bug in DefaultGraphWalker. Due to this bug the operation stack would always get popped when you exited this call. As a result the stack would actually just contain the last node and not all the nodes on the path leading to the last node.
          4. Which particular code file are you talking about here?
          2. Not sure how that is different from what we were doing before?
          1. Made the change

          Show
          Ashish Thusoo added a comment - @Namit 1. I have added a bunch of tests with the new patch. 2/3/5. Fixed the typos and wrong comments and javadocs 4. I don't think I have to check for exprNodeNullDesc. The unknown value of an operand is denoted by a null value in exprNodeConstantDesc. exprNodeNullDesc is used to denote the liternal NULL which should be treated as is for predicate pruning expression. e.g. if(ISNULL(NULL), 1, 0) should evaluate to 1 instead of unknown. So I think it is correct to not check the exprNodeNullDesc for the function processors. Thoughts? @Prasad 3. I have used the ppd parameter because I cannot do partition pruning without ppd. So even if I have a separate parameter it would still have to be checked along with ppd. At that point, it seems having a separate parameter for ppd does not add much value. Thoughts? 7. Removed 6. I have changed the name ot ASTPartitionPruner 5. This fixes a bug in DefaultGraphWalker. Due to this bug the operation stack would always get popped when you exited this call. As a result the stack would actually just contain the last node and not all the nodes on the path leading to the last node. 4. Which particular code file are you talking about here? 2. Not sure how that is different from what we were doing before? 1. Made the change
          Hide
          Ashish Thusoo added a comment -

          New patch with tests and changes.

          Show
          Ashish Thusoo added a comment - New patch with tests and changes.
          Ashish Thusoo made changes -
          Attachment patch-578_2.txt [ 12415714 ]
          Hide
          Zheng Shao added a comment -

          We should add a separate parameter to enable the new partition pruner.

          Since the new partition pruner works only when ppd is enabled, we should have 3 possible modes instead of 4.
          hive.opt.ppd=false, hive.partition.pruner=ast
          hive.opt.ppd=true, hive.partition.pruner=ast
          hive.opt.ppd=true, hive.partition.pruner=opt

          Show
          Zheng Shao added a comment - We should add a separate parameter to enable the new partition pruner. Since the new partition pruner works only when ppd is enabled, we should have 3 possible modes instead of 4. hive.opt.ppd=false, hive.partition.pruner=ast hive.opt.ppd=true, hive.partition.pruner=ast hive.opt.ppd=true, hive.partition.pruner=opt
          Hide
          Ashish Thusoo added a comment -

          New patch where the ppr is controlled by a boolean parameter. It kicks in only if ppd and ppr are both true. If any one of them are not true the AST based approach is used.

          Show
          Ashish Thusoo added a comment - New patch where the ppr is controlled by a boolean parameter. It kicks in only if ppd and ppr are both true. If any one of them are not true the AST based approach is used.
          Ashish Thusoo made changes -
          Attachment patch-578_3.txt [ 12415789 ]
          Ashish Thusoo made changes -
          Link This issue is duplicated by HIVE-218 [ HIVE-218 ]
          Hide
          Namit Jain added a comment -

          atching file ql/src/java/org/apache/hadoop/hive/ql/parse/PartitionPruner.java
          (Stripping trailing CRs from patch.)
          can't find file to patch at input line 27070
          Perhaps you used the wrong -p or --strip option?
          The text leading up to this was:
          --------------------------

          Index: ql/src/java/org/apache/hadoop/hive/ql/parse/ASTPartitionPruner.java
          ===================================================================
          — ql/src/java/org/apache/hadoop/hive/ql/parse/ASTPartitionPruner.java (revision 801363)
          +++ ql/src/java/org/apache/hadoop/hive/ql/parse/ASTPartitionPruner.java (working copy)
          --------------------------
          File to patch:
          Skip this patch? [y]
          Skipping patch.
          5 out of 5 hunks ignored
          (Stripping trailing CRs from patch.)
          patching file ql/src/java/org/apache/hadoop/hive/ql/parse/TypeCheckProcFactory.java
          (Stripping trailing CRs from patch.)
          patching file ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
          (Stripping trailing CRs from patch.)
          patching file ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java
          (Stripping trailing CRs from patch.)
          patching file ql/src/java/org/apache/hadoop/hive/ql/parse/PrunedPartitionList.java

          Problems in applying the patch - can you generate the patch again ?

          Show
          Namit Jain added a comment - atching file ql/src/java/org/apache/hadoop/hive/ql/parse/PartitionPruner.java (Stripping trailing CRs from patch.) can't find file to patch at input line 27070 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -------------------------- Index: ql/src/java/org/apache/hadoop/hive/ql/parse/ASTPartitionPruner.java =================================================================== — ql/src/java/org/apache/hadoop/hive/ql/parse/ASTPartitionPruner.java (revision 801363) +++ ql/src/java/org/apache/hadoop/hive/ql/parse/ASTPartitionPruner.java (working copy) -------------------------- File to patch: Skip this patch? [y] Skipping patch. 5 out of 5 hunks ignored (Stripping trailing CRs from patch.) patching file ql/src/java/org/apache/hadoop/hive/ql/parse/TypeCheckProcFactory.java (Stripping trailing CRs from patch.) patching file ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java (Stripping trailing CRs from patch.) patching file ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java (Stripping trailing CRs from patch.) patching file ql/src/java/org/apache/hadoop/hive/ql/parse/PrunedPartitionList.java Problems in applying the patch - can you generate the patch again ?
          Hide
          Ashish Thusoo added a comment -

          I think the patch is correct. Before applying the patch just do an svn mv on the old PartitionPruner.java to ASTPartitionPruner.java. Actually change the class name in eclipse and then apply the patch. That should work.

          Show
          Ashish Thusoo added a comment - I think the patch is correct. Before applying the patch just do an svn mv on the old PartitionPruner.java to ASTPartitionPruner.java. Actually change the class name in eclipse and then apply the patch. That should work.
          Hide
          Namit Jain added a comment -

          Done - running tests right now

          Show
          Namit Jain added a comment - Done - running tests right now
          Hide
          Namit Jain added a comment -

          Committed. Thanks Ashish

          Show
          Namit Jain added a comment - Committed. Thanks Ashish
          Namit Jain made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Carl Steinbach made changes -
          Fix Version/s 0.4.0 [ 12313714 ]
          Carl Steinbach made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Ashish Thusoo
              Reporter:
              Ashish Thusoo
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development