Pig
  1. Pig
  2. PIG-858

Order By followed by "replicated" join fails while compiling MR-plan from physical plan

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.6.0
    • Component/s: impl
    • Labels:
      None

      Description

      Consider the query:

      A = load 'a';
      B = order A by $0;
      C = join A by $0, B by $0;
      explain C;
      

      works. But if replicated join is used instead

      A = load 'a';
      B = order A by $0;
      C = join A by $0, B by $0 using "replicated";
      explain C;
      

      this fails with ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2034: Error compiling operator POFRJoin
      relevant stacktrace:

      Caused by: java.lang.RuntimeException: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException: ERROR 2034: Error compiling operator POFRJoin
              at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:306)
              at org.apache.pig.PigServer.explain(PigServer.java:574)
              ... 8 more
      Caused by: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException: ERROR 2034: Error compiling operator POFRJoin
              at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:942)
              at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:173)
              at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:342)
              at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:327)
              at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:233)
              at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:301)
              at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.explain(MapReduceLauncher.java:278)
              at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:303)
              ... 9 more
      Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
              at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:901)
              ... 16 more
      
      1. pig-858.patch
        14 kB
        Ashutosh Chauhan

        Activity

        Hide
        Ashutosh Chauhan added a comment -

        While POFRJoin is getting compiled in MRCompiler, it needs to identify for each of its
        predecessor in physical plan of which compiled MROperator they are part of. Currently, it is
        assumed to be one of the compiledInputs(an array of MRoper which are immediate predecessor of current MROper in MROper DAG).
        Mostly this is true, but in cases where one physical operator results in two or more MR operator, this may not be true, as is the
        case here. When there is an order-by before FRJoin; one of the inputs of POFRJoin will be
        POSort, but POSort operator will be in the first MROper of the two generated MROperator
        and thus will not be found in compiledInputs (which contains second MROper). Thus,
        current way of identifying corresponding MRoper of a physical operator is unreliable.
        This bug also affects the implementation of merge-sort join
        https://issues.apache.org/jira/browse/PIG-845 . Since POMergeJoin needs to know which MROper
        corresponds to its left input and which one corresponds to its right. It can do so by looking
        into compiledInputs as long as there is no order-by (or similiar PO which results in
        multiple MROper) as its predecessors. Doing order-by before using merge
        join is however a natural use-case there.

        Proposal is to introduce a new private member variable in MRCompiler phyToMROperMap
        (similiar to logToPhyMap) using which leaf MROper for a given
        physical operator can be identified. Thoughts?

        Show
        Ashutosh Chauhan added a comment - While POFRJoin is getting compiled in MRCompiler, it needs to identify for each of its predecessor in physical plan of which compiled MROperator they are part of. Currently, it is assumed to be one of the compiledInputs(an array of MRoper which are immediate predecessor of current MROper in MROper DAG). Mostly this is true, but in cases where one physical operator results in two or more MR operator, this may not be true, as is the case here. When there is an order-by before FRJoin; one of the inputs of POFRJoin will be POSort, but POSort operator will be in the first MROper of the two generated MROperator and thus will not be found in compiledInputs (which contains second MROper). Thus, current way of identifying corresponding MRoper of a physical operator is unreliable. This bug also affects the implementation of merge-sort join https://issues.apache.org/jira/browse/PIG-845 . Since POMergeJoin needs to know which MROper corresponds to its left input and which one corresponds to its right. It can do so by looking into compiledInputs as long as there is no order-by (or similiar PO which results in multiple MROper) as its predecessors. Doing order-by before using merge join is however a natural use-case there. Proposal is to introduce a new private member variable in MRCompiler phyToMROperMap (similiar to logToPhyMap) using which leaf MROper for a given physical operator can be identified. Thoughts?
        Hide
        Ashutosh Chauhan added a comment -

        Patch as discussed in previous comment. Also included are test cases, where blocking operator (order-by, distinct) occurs before FRjoin.

        Show
        Ashutosh Chauhan added a comment - Patch as discussed in previous comment. Also included are test cases, where blocking operator (order-by, distinct) occurs before FRjoin.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12419592/pig-858.patch
        against trunk revision 819057.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/47/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/47/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/47/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12419592/pig-858.patch against trunk revision 819057. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/47/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/47/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/47/console This message is automatically generated.
        Hide
        Alan Gates added a comment -

        I'm reviewing this patch.

        Show
        Alan Gates added a comment - I'm reviewing this patch.
        Hide
        Alan Gates added a comment -

        Mostly looks straight forward and passes all the tests. You made a number of changes in MRCompiler.visitUnion. I don't understand what exactly you were changing there. Could you give a brief overview of those changes?

        Show
        Alan Gates added a comment - Mostly looks straight forward and passes all the tests. You made a number of changes in MRCompiler.visitUnion. I don't understand what exactly you were changing there. Could you give a brief overview of those changes?
        Hide
        Ashutosh Chauhan added a comment -

        visitUnion has same changes as others visit functions, that is it adds MR Operator corresponding to POUnion in phyToMROpMap map. Real changes are in visitFRJoin. Earlier in visitFRJoin, it used to look in compiledInputs array of MROper one by one trying to match MROPer leaf PO with POFRJoin using operator key. Now, it doesn't need to do that it can simply lookup in the phyToMROpMap.

        Show
        Ashutosh Chauhan added a comment - visitUnion has same changes as others visit functions, that is it adds MR Operator corresponding to POUnion in phyToMROpMap map. Real changes are in visitFRJoin. Earlier in visitFRJoin, it used to look in compiledInputs array of MROper one by one trying to match MROPer leaf PO with POFRJoin using operator key. Now, it doesn't need to do that it can simply lookup in the phyToMROpMap.
        Hide
        Ashutosh Chauhan added a comment -

        Its been a while since I did that patch. So, bit more clarification: We are interested in finding PO which corresponds to "fragment" PO input of POFRJoin. This PO is already compiled and is in one the MROper. Earlier we will iterate through compiledInputs array trying to match this PO with PO contained in each MROperator. This fails as discussed in previous comments. With this change, since we keep track of MR operator with each physical operator it need not to do that but can simply look up for MROper corresponding to "fragment" PO in the phyToMROpMap.

        Show
        Ashutosh Chauhan added a comment - Its been a while since I did that patch. So, bit more clarification: We are interested in finding PO which corresponds to "fragment" PO input of POFRJoin. This PO is already compiled and is in one the MROper. Earlier we will iterate through compiledInputs array trying to match this PO with PO contained in each MROperator. This fails as discussed in previous comments. With this change, since we keep track of MR operator with each physical operator it need not to do that but can simply look up for MROper corresponding to "fragment" PO in the phyToMROpMap.
        Hide
        Alan Gates added a comment -

        Fix checked in. Thanks Ashutosh for the patch and explanation.

        Show
        Alan Gates added a comment - Fix checked in. Thanks Ashutosh for the patch and explanation.

          People

          • Assignee:
            Ashutosh Chauhan
            Reporter:
            Ashutosh Chauhan
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development