Pig
  1. Pig
  2. PIG-1289

PIG Join fails while doing a filter on joined data

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.7.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      PIG Join fails while doing a filter on joined data

      Here are the steps to reproduce it:
      -bash-3.1$ pig -latest -x local
      grunt> a = load 'first.dat' using PigStorage('\u0001') as (f1:int, f2:chararray);
      grunt> DUMP a;
      (1,A)
      (2,B)
      (3,C)
      (4,D)

      grunt> b = load 'second.dat' using PigStorage() as (f3:chararray);
      grunt> DUMP b;
      (A)
      (D)
      (E)

      grunt> c = join a by f2 LEFT OUTER, b by f3;
      grunt> DUMP c;
      (1,A,A)
      (2,B,)
      (3,C,)
      (4,D,D)
      grunt> describe c;
      c:

      {a::f1: int,a::f2: chararray,b::f3: chararray}

      grunt> d = filter c by (f3 is null or f3 =='');

      grunt> dump d;
      2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for b
      2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for b
      2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for a
      2010-03-03 15:00:37,130 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for a
      2010-03-03 15:00:37,130 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias d

      This one is failing too:

      grunt> d = filter c by (b::f3 is null or b::f3 =='');

      or this one not returning results as expected:

      grunt> d = foreach c generate f1 as f1, f2 as f2, f3 as f3;
      grunt> e = filter d by (f3 is null or f3 =='');
      grunt> DUMP e;
      (1,A,)
      (2,B,)
      (3,C,)
      (4,D,)

      while the expected result is
      (2,B,)
      (3,C,)

      1. PIG-1289-2.patch
        3 kB
        Daniel Dai
      2. PIG-1289-1.patch
        4 kB
        Daniel Dai

        Issue Links

          Activity

          Hide
          Daniel Dai added a comment -

          Unit test failure due to port conflict. Manual test successful. Patch committed.

          Show
          Daniel Dai added a comment - Unit test failure due to port conflict. Manual test successful. Patch committed.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12438992/PIG-1289-2.patch
          against trunk revision 924034.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/253/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/253/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/253/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438992/PIG-1289-2.patch against trunk revision 924034. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/253/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/253/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/253/console This message is automatically generated.
          Hide
          Daniel Dai added a comment -

          Oh, you are right. I need to further change the join type to regular join to make my idea work. Since changing the join type may broke some other part of the code, so let's be safe here and only push filter in front of a branch not generating null.

          Show
          Daniel Dai added a comment - Oh, you are right. I need to further change the join type to regular join to make my idea work. Since changing the join type may broke some other part of the code, so let's be safe here and only push filter in front of a branch not generating null.
          Hide
          Alan Gates added a comment -

          In the case of

          D = filter C by t > 0

          the filter will evaluate to null when t is null. By definition filters return only records that evaluate true. So t > 0 will have the affect of filtering out all outer records of A because t will be null for every one of them. That is, it turns the join into an inner join. However, if the filter is pushed above the join, it will remain an outer join, since it will only filter the records from B where t > 0 and not the outer records from A. Thus this transformation is not output neutral.

          Show
          Alan Gates added a comment - In the case of D = filter C by t > 0 the filter will evaluate to null when t is null. By definition filters return only records that evaluate true. So t > 0 will have the affect of filtering out all outer records of A because t will be null for every one of them. That is, it turns the join into an inner join. However, if the filter is pushed above the join, it will remain an outer join, since it will only filter the records from B where t > 0 and not the outer records from A. Thus this transformation is not output neutral.
          Hide
          Daniel Dai added a comment -

          Yes, it is safe not push filter up a branch that will be producing nulls. I might be wrong but what I did is try to be a little bit more aggressive. Since the only extra value outer join will produce is null, so if filter is not testing null, we can still push it up even if it is on the inner branch.

          Eg:
          A = load 'foo' as (q, r, s);
          B = load 'bar ' as (t, u, v);
          C = join A on q outer, B on t;
          D = filter C by t > 0;

          The production C consists of two parts:
          A + B
          A + "null"

          If we do a filter after join, it is a union on this two parts:
          filter(A + B) union filter(A + "null")

          If we are not testing nullability (eg, t > 0), then filter(A + "null") will not have any production, so
          filter(A + B) union filter(A + "null") = filter(A + B)

          In this case, outer join is equivalent as a regular join (since all generated null B records are filtered away), so we can still push the filter up.

          Show
          Daniel Dai added a comment - Yes, it is safe not push filter up a branch that will be producing nulls. I might be wrong but what I did is try to be a little bit more aggressive. Since the only extra value outer join will produce is null, so if filter is not testing null, we can still push it up even if it is on the inner branch. Eg: A = load 'foo' as (q, r, s); B = load 'bar ' as (t, u, v); C = join A on q outer, B on t; D = filter C by t > 0; The production C consists of two parts: A + B A + "null" If we do a filter after join, it is a union on this two parts: filter(A + B) union filter(A + "null") If we are not testing nullability (eg, t > 0), then filter(A + "null") will not have any production, so filter(A + B) union filter(A + "null") = filter(A + B) In this case, outer join is equivalent as a regular join (since all generated null B records are filtered away), so we can still push the filter up.
          Hide
          Alan Gates added a comment -

          I think I'm missing something here. It seems the criteria here should be that the filter is not pushed up a branch that will be producing nulls. So, given a script like:

          A = load 'foo' as (q, r, s);
          B = load 'bar ' as (t, u, v);
          C = join A on q outer, B on t;

          D = filter C by q > 0;

          is pushable, but

          D = filter C by t > 0;

          is not. I think the test that at most one element of the join is outer and it is the one we are considering meets this criteria. But I don't understand the need to check for null tests.

          Show
          Alan Gates added a comment - I think I'm missing something here. It seems the criteria here should be that the filter is not pushed up a branch that will be producing nulls. So, given a script like: A = load 'foo' as (q, r, s); B = load 'bar ' as (t, u, v); C = join A on q outer, B on t; D = filter C by q > 0; is pushable, but D = filter C by t > 0; is not. I think the test that at most one element of the join is outer and it is the one we are considering meets this criteria. But I don't understand the need to check for null tests.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12438569/PIG-1289-1.patch
          against trunk revision 922169.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/246/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/246/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/246/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438569/PIG-1289-1.patch against trunk revision 922169. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/246/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/246/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/246/console This message is automatically generated.
          Hide
          Daniel Dai added a comment -

          The problem is we push filter in front of the out join. We should not push filter when:
          1. filter is pushed to inner branch
          2. filter condition is to test nullability

          Show
          Daniel Dai added a comment - The problem is we push filter in front of the out join. We should not push filter when: 1. filter is pushed to inner branch 2. filter condition is to test nullability

            People

            • Assignee:
              Daniel Dai
              Reporter:
              Karim Saadah
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development