Pig
  1. Pig
  2. PIG-2316

Incorrect results for FILTER *** BY ( *** OR ***) with FilterLogicExpressionSimplifier optimizer turned on

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.8.0, 0.8.1, 0.9.0, 0.9.1
    • Fix Version/s: 0.9.2, 0.10.0, 0.8.2
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      FilterLogicExpressionSimplifier optimization rule is disabled by default. To enable the optimization rule, set the property pig.exec.filterLogicExpressionSimplifier to true.
      Show
      FilterLogicExpressionSimplifier optimization rule is disabled by default. To enable the optimization rule, set the property pig.exec.filterLogicExpressionSimplifier to true.

      Description

      An example for this bug:

      cat weird.txt
      1,a
      2,b
      3,c

      When running pig with the following statements:

      A = LOAD 'weird.txt' using PigStorage(',') AS (col1:int,col2);
      B = FILTER A BY ((col1==1) OR (col1 != 1));
      DUMP B;

      I expect to get the result of all three rows back, but I receive only two rows.

      (2,b)
      (3,c)

      When we start pig with optimizer turning off.

      pig -optimizer_off All

      With optimizer turning off, we get the expected results and I get three rows for the same statements.

      (1,a)
      (2,b)
      (3,c)

      --------------------------------------------------------

      This bug was test on:

      pig-0.9.1,
      pig-0.9.0,
      pig-0.8.1,
      pig-0.8.0

      All produced same incorrect results.

      --------------------------------------------------------

      When looked at the logical plan for this example, we found FilterlogicExpressionSimplifier optimizer produced incorrect logical plan. So we guess the bug is caused by FilterlogicExpressionSimplifier optimizer.

      1. pig-2316-08-v3.txt
        9 kB
        Thejas M Nair
      2. pig-2316-09-v3.txt
        8 kB
        Thejas M Nair
      3. pig-2316-trunk-v3.txt
        8 kB
        Thejas M Nair
      4. pig-2316-trunk-v1.txt
        4 kB
        Koji Noguchi

        Issue Links

          Activity

          Hide
          Thejas M Nair added a comment -

          tests passed. patch committed to 0.8 branch and trunk as well.

          Show
          Thejas M Nair added a comment - tests passed. patch committed to 0.8 branch and trunk as well.
          Hide
          Thejas M Nair added a comment -

          unit tests passed for 0.9 branch. patch committed to 0.9 branch. Running tests for 0.8 branch and trunk.

          Show
          Thejas M Nair added a comment - unit tests passed for 0.9 branch. patch committed to 0.9 branch. Running tests for 0.8 branch and trunk.
          Hide
          Olga Natkovich added a comment -

          +1 on the pateches.

          The name that we are using to specify exclusion rules (optimizerRules) is confusing but we do not need to fix it in this patch

          Show
          Olga Natkovich added a comment - +1 on the pateches. The name that we are using to specify exclusion rules (optimizerRules) is confusing but we do not need to fix it in this patch
          Hide
          Thejas M Nair added a comment -

          pig-2316-08-v3.txt - patch for 0.8 branch

          Show
          Thejas M Nair added a comment - pig-2316-08-v3.txt - patch for 0.8 branch
          Hide
          Olga Natkovich added a comment -

          Thejas, we also need a patch for 0.8, thanks

          Show
          Olga Natkovich added a comment - Thejas, we also need a patch for 0.8, thanks
          Hide
          Thejas M Nair added a comment -

          pig-2316-09-v3.txt - v3 version for 0.9 branch

          Show
          Thejas M Nair added a comment - pig-2316-09-v3.txt - v3 version for 0.9 branch
          Hide
          Thejas M Nair added a comment -

          pig-2316-trunk-v3.txt - patch for trunk to have FilterLogicExpressionSimplifier disabled by default. Incorporates Koji's changes in v1.txt .

          Show
          Thejas M Nair added a comment - pig-2316-trunk-v3.txt - patch for trunk to have FilterLogicExpressionSimplifier disabled by default. Incorporates Koji's changes in v1.txt .
          Hide
          Thejas M Nair added a comment -

          what is the speedup in script runtime due to this rule?

          I think most queries would not benefit from this optimization rule. Among the one that do, I doubt if there is going to be a noticeable improvement in runtime. I am also not aware of any performance benchmarks that have been done for this rule.

          Show
          Thejas M Nair added a comment - what is the speedup in script runtime due to this rule? I think most queries would not benefit from this optimization rule. Among the one that do, I doubt if there is going to be a noticeable improvement in runtime. I am also not aware of any performance benchmarks that have been done for this rule.
          Hide
          Viraj Bhat added a comment -

          Thejas, I agree with your comments. I also agree that we should disable this optimization, for the next releases of Pig, since we are getting wrong results without any warnings? At the same time, what is the speedup in script runtime due to this rule?

          Show
          Viraj Bhat added a comment - Thejas, I agree with your comments. I also agree that we should disable this optimization, for the next releases of Pig, since we are getting wrong results without any warnings? At the same time, what is the speedup in script runtime due to this rule?
          Hide
          Thejas M Nair added a comment -

          The LogicalExpressionSimplifier rules are complex and it has a large number of them. This is the fourth bug originating from this optimization rule that causes correctness issues, and that is related to the complexity of this rule. It is hard to understand and maintain this code. I don't expect these rules to show enough performance gains to justify these costs (complexity, maintainability, chances of bugs).

          I think this rule should be disabled by default in 0.9.next and 0.10. In next versions of pig, we can extract simpler rules from this one and have more exhaustive test coverage before turning it on by default.

          Show
          Thejas M Nair added a comment - The LogicalExpressionSimplifier rules are complex and it has a large number of them. This is the fourth bug originating from this optimization rule that causes correctness issues, and that is related to the complexity of this rule. It is hard to understand and maintain this code. I don't expect these rules to show enough performance gains to justify these costs (complexity, maintainability, chances of bugs). I think this rule should be disabled by default in 0.9.next and 0.10. In next versions of pig, we can extract simpler rules from this one and have more exhaustive test coverage before turning it on by default.
          Hide
          Koji Noguchi added a comment -

          Resubmitting the original patch since I forgot to grant license on the original one.

          Show
          Koji Noguchi added a comment - Resubmitting the original patch since I forgot to grant license on the original one.
          Hide
          Thejas M Nair added a comment -

          Koji pointed out that the logical transformation in above case is actually correct. Deleting the v2 patch I submitted.

          Show
          Thejas M Nair added a comment - Koji pointed out that the logical transformation in above case is actually correct. Deleting the v2 patch I submitted.
          Hide
          Thejas M Nair added a comment -
          Applying pig-2316-trunk-v1.txt triggers another bug. For the following filter clause, note that filter plan in MR plan is incomplete.
           
          
          B = FILTER A BY ((col1==1) OR (col1 != 2));
          
          Filter in MR plan - 
          
          B: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-11
          |
          |---B: Filter[bag] - scope-7
              |   |
              |   Not Equal To[boolean] - scope-10
              |   |
              |   |---Project[int][0] - scope-8
              |   |
              |   |---Constant(2) - scope-9
              |
              |---A: New For Each(false,false)[bag] - scope-6
          
          

          pig-2316-trunk-v2.txt has the fix for this issue.

          Show
          Thejas M Nair added a comment - Applying pig-2316-trunk-v1.txt triggers another bug. For the following filter clause, note that filter plan in MR plan is incomplete. B = FILTER A BY ((col1==1) OR (col1 != 2)); Filter in MR plan - B: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-11 | |---B: Filter[bag] - scope-7 | | | Not Equal To[ boolean ] - scope-10 | | | |---Project[ int ][0] - scope-8 | | | |---Constant(2) - scope-9 | |---A: New For Each( false , false )[bag] - scope-6 pig-2316-trunk-v2.txt has the fix for this issue.
          Hide
          Koji Noguchi added a comment -

          It seems like 'equals' part is failing inside handleComparison method in LogicalExpressionSimplifier.

          else if (isEqual1 && isNotEqual2) {
              if (val1.equals(val2)) return Exclusive;
          

          Since val1,val2 are ConstantExpression, should it be calling
          val1.isEqual(val2)
          or
          val1.getValue().equals(val2.getValue()) ?

          This patch does the latter.
          (My understanding of pig is very limited so this could be way off.)

          Show
          Koji Noguchi added a comment - It seems like 'equals' part is failing inside handleComparison method in LogicalExpressionSimplifier. else if (isEqual1 && isNotEqual2) { if (val1.equals(val2)) return Exclusive; Since val1,val2 are ConstantExpression, should it be calling val1.isEqual(val2) or val1.getValue().equals(val2.getValue()) ? This patch does the latter. (My understanding of pig is very limited so this could be way off.)
          Hide
          Koji Noguchi added a comment -

          Our user gave us a more realistic testcase.
          Looks serious.

           
          -bash-3.2$ cat weird.txt 
          1,a
          1,c
          2,b
          3,c
          -bash-3.2$ cat test2.pig 
          A = LOAD 'weird.txt' using PigStorage(',') AS (col1:int,col2);
          B = FILTER A BY ((col1==1 AND col2=='c') OR (col1 != 1));
          DUMP B;
          
          -bash-3.2$ pig -x local  test2.pig  
          (2,b)
          (3,c)
          -bash-3.2$
          -bash-3.2$ pig -x local  -optimizer_off ALL test2.pig 
          (1,c)
          (2,b)
          (3,c)
          -bash-3.2$ 
          
          Show
          Koji Noguchi added a comment - Our user gave us a more realistic testcase. Looks serious. -bash-3.2$ cat weird.txt 1,a 1,c 2,b 3,c -bash-3.2$ cat test2.pig A = LOAD 'weird.txt' using PigStorage(',') AS (col1:int,col2); B = FILTER A BY ((col1==1 AND col2=='c') OR (col1 != 1)); DUMP B; -bash-3.2$ pig -x local test2.pig (2,b) (3,c) -bash-3.2$ -bash-3.2$ pig -x local -optimizer_off ALL test2.pig (1,c) (2,b) (3,c) -bash-3.2$
          Hide
          Koji Noguchi added a comment -

          Huanyu later discovered that result is only incorrect when both the left variables and the right values are the same.

          ((col1==1) OR (col1 != 1));

          Since this is not a common query, lowering priority.

          Show
          Koji Noguchi added a comment - Huanyu later discovered that result is only incorrect when both the left variables and the right values are the same. ((col1==1) OR (col1 != 1)); Since this is not a common query, lowering priority.

            People

            • Assignee:
              Koji Noguchi
              Reporter:
              Huanyu Zhao
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development