Pig
  1. Pig
  2. PIG-1494

PIG Logical Optimization: Use CNF in PushUpFilter

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.7.0
    • Fix Version/s: None
    • Component/s: impl
    • Labels:
      None

      Description

      The PushUpFilter rule is not able to handle complicated boolean expressions.

      For example, SplitFilter rule is splitting one LOFilter into two by "AND". However it will not be able to split LOFilter if the top level operator is "OR". For example:

      ex script:
      A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
      B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
      C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
      J1 = JOIN B by b1, C by c1;
      J2 = JOIN J1 by $0, A by a1;
      D = Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);
      explain D;
      In the above example, the PushUpFilter is not able to push any filter condition across any join as it contains columns from all branches (inputs). But if we convert this expression into "Conjunctive Normal Form" (CNF) then we would be able to push filter condition c1< 10 and c2 == 5 below both join conditions. Here is the CNF expression for highlighted line:

      ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )

      Suggestion: It would be a good idea to convert LOFilter's boolean expression into CNF, it would then be easy to push parts (conjuncts) of the LOFilter boolean expression selectively. We would also not require rule SplitFilter anymore if we were to add this utility to rule PushUpFilter itself.

        Issue Links

          Activity

          Hide
          Olga Natkovich added a comment -

          Unlinking from 0.8 since we are about to branch for release

          Show
          Olga Natkovich added a comment - Unlinking from 0.8 since we are about to branch for release
          Hide
          Olga Natkovich added a comment -

          Can this be moved from 0.8 to 0.9 release since we are about to branch for 0.9?

          Show
          Olga Natkovich added a comment - Can this be moved from 0.8 to 0.9 release since we are about to branch for 0.9?
          Hide
          Olga Natkovich added a comment -

          Swati, I am assigning it to you since I am assuming you plan to work on it for 0.8. Otherwise, it is unlikely to happen in 0.8 timeframe. Feel free to unassign and unlink from this release if this is not the case

          Show
          Olga Natkovich added a comment - Swati, I am assigning it to you since I am assuming you plan to work on it for 0.8. Otherwise, it is unlikely to happen in 0.8 timeframe. Feel free to unassign and unlink from this release if this is not the case
          Hide
          Swati Jain added a comment -

          Reply from Yan Zhou:

          The filter logic split problem can be divided into 2 parts:
          1) the filtering logic that can be applied to individual input sources;
          and 2) the filtering logic that has to be applied when merged(or joined)
          inputs are processed.

          The benefits for 1) are any benefits if the underlying storage supports
          predicate pushdown; plus the memory/CPU savings by PIG for not
          processing the unqualified rows.

          For 2), the purpose is not paying higher evaluation costs than
          necessary.

          For 1), no normal form is necessary. The original logical expression
          tree
          can be trimmed off any sub-expressions that are not constants nor just
          from a particular input source. The complexity is linear with the tree
          size; while the use of normal form could potentially lead to exponential
          complexity. The difficulty with this approach is how to generate the
          filtering logic for 2); while CNF can be used to easily figure out the
          logic for 2). However, the exact logic in 2) might not be cheaper to
          evaluate than the original logical expression. An example is "Filter J2
          by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the
          filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
          ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost
          will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced
          to 4, compared with 3 logical evaluations in the original form.

          In summary, if only 1) is desired, the tree trimming is enough. If 2) is
          desired too, then CNF could be used but its complexity should be
          controlled and the cost of the filtering logic evaluation in 2) should
          be computed and compared with the original expression evaluation cost.
          Further optimization is possible in this direction.

          Another potential optimization to consider is to support logical
          expression tree of multiple children vs. the binary tree after taking
          into consideration of the commutative property of OR and AND operations.
          The advantages are less tree traversal costs and easier to change the
          evaluation ordering within the same sub-tree in order to maximize the
          possibilities to short-cut the evaluations. Although this is general for
          all logical expressions, this tends to be more suitable for normal form
          handlings as normal forms group the sub-expressions by the operators
          that act on the sub-expressions.

          Show
          Swati Jain added a comment - Reply from Yan Zhou: The filter logic split problem can be divided into 2 parts: 1) the filtering logic that can be applied to individual input sources; and 2) the filtering logic that has to be applied when merged(or joined) inputs are processed. The benefits for 1) are any benefits if the underlying storage supports predicate pushdown; plus the memory/CPU savings by PIG for not processing the unqualified rows. For 2), the purpose is not paying higher evaluation costs than necessary. For 1), no normal form is necessary. The original logical expression tree can be trimmed off any sub-expressions that are not constants nor just from a particular input source. The complexity is linear with the tree size; while the use of normal form could potentially lead to exponential complexity. The difficulty with this approach is how to generate the filtering logic for 2); while CNF can be used to easily figure out the logic for 2). However, the exact logic in 2) might not be cheaper to evaluate than the original logical expression. An example is "Filter J2 by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced to 4, compared with 3 logical evaluations in the original form. In summary, if only 1) is desired, the tree trimming is enough. If 2) is desired too, then CNF could be used but its complexity should be controlled and the cost of the filtering logic evaluation in 2) should be computed and compared with the original expression evaluation cost. Further optimization is possible in this direction. Another potential optimization to consider is to support logical expression tree of multiple children vs. the binary tree after taking into consideration of the commutative property of OR and AND operations. The advantages are less tree traversal costs and easier to change the evaluation ordering within the same sub-tree in order to maximize the possibilities to short-cut the evaluations. Although this is general for all logical expressions, this tends to be more suitable for normal form handlings as normal forms group the sub-expressions by the operators that act on the sub-expressions.

            People

            • Assignee:
              Swati Jain
              Reporter:
              Swati Jain
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development