Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-2563

IndexOutOfBoundsException: while projecting fields from a bag

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.9.1, 0.10.0
    • 0.10.0, 0.11
    • None
    • None
    • Reviewed

    Description

      The below script fails with Pig 0.9 / Pig 0.10 but works fine for Pig 0.8.

      A = load 'i1' as (a,b,c:chararray);
      B = load 'i2' as (d,e,f:chararray);
      C = cogroup A by a, B by d;
      D = foreach C { 
        tmp = B.d;
        tmp_dis = distinct tmp;
        generate A,B,tmp_dis ; } ;
      E = foreach D generate B.(d,e) as v;
      dump E;
      

      The script fails with the below exception. Looks like DereferenceExpression is using wrong schema to build inner schema.

      java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
      at java.util.ArrayList.RangeCheck(ArrayList.java:547)
      at java.util.ArrayList.get(ArrayList.java:322)
      at org.apache.pig.newplan.logical.relational.LogicalSchema.getField(LogicalSchema.java:653)
      at org.apache.pig.newplan.logical.expression.DereferenceExpression.getFieldSchema(DereferenceExpression.java:167)
      at org.apache.pig.newplan.logical.relational.LOGenerate.getSchema(LOGenerate.java:88)
      at org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visit(TypeCheckingRelVisitor.java:160)
      at org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:242)

      Attachments

        1. PIG-2563-1.patch
          7 kB
          Daniel Dai
        2. PIG-2563-2.patch
          8 kB
          Daniel Dai

        Activity

          Good catch! I've seen similar errors before, but was never able to isolate it. I actually isolated it with an even smaller snippet below:

          A = load 'i1' as (a,b,c:chararray);
          B = GROUP A BY a;
          C = foreach B { 
            tmp = A.a;
            generate A, tmp; } ;
          D = foreach C generate A.(a,b) as v;
          

          Perhaps related is the following, which has a weird parse error:

          A = load 'i1' as (a,b,c:chararray);
          B = GROUP A BY a;
          C = foreach B { 
            tmp = A.a;
            generate A; } ;
          D = foreach C generate A.(a,b) as v;
          
          jcoveney Jonathan Coveney added a comment - Good catch! I've seen similar errors before, but was never able to isolate it. I actually isolated it with an even smaller snippet below: A = load 'i1' as (a,b,c:chararray); B = GROUP A BY a; C = foreach B { tmp = A.a; generate A, tmp; } ; D = foreach C generate A.(a,b) as v; Perhaps related is the following, which has a weird parse error: A = load 'i1' as (a,b,c:chararray); B = GROUP A BY a; C = foreach B { tmp = A.a; generate A; } ; D = foreach C generate A.(a,b) as v;
          daijy Daniel Dai added a comment -

          Logical plan for the foreach nested plan is wrong. Using Jonathan's first test case as example:

          ---C: (Name: LOForEach Schema: A#21:bag {#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)}

          ,tmp#21:bag

          {#22:tuple(a#18:bytearray)}

          )

           
          (Name: LOGenerate[false,false] Schema: A#21:bag {#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)}

          ,tmp#21:bag

          {#22:tuple(a#18:bytearray)}

          )

             
            A:(Name: Project Type: bag Uid: 21 Input: 0 Column: )
             
            tmp:(Name: Project Type: bag Uid: 21 Input: 1 Column: )
           
            ---A: (Name: LOInnerLoad[1] Schema: a#18:bytearray,b#19:bytearray,c#20:chararray)

          Here we only generate one LOInnerLoad. We should generate two LOInnerLoad instead.

          daijy Daniel Dai added a comment - Logical plan for the foreach nested plan is wrong. Using Jonathan's first test case as example: ---C: (Name: LOForEach Schema: A#21:bag {#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)} ,tmp#21:bag {#22:tuple(a#18:bytearray)} )   (Name: LOGenerate [false,false] Schema: A#21:bag {#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)} ,tmp#21:bag {#22:tuple(a#18:bytearray)} )       A:(Name: Project Type: bag Uid: 21 Input: 0 Column: )       tmp:(Name: Project Type: bag Uid: 21 Input: 1 Column: )     ---A: (Name: LOInnerLoad [1] Schema: a#18:bytearray,b#19:bytearray,c#20:chararray) Here we only generate one LOInnerLoad. We should generate two LOInnerLoad instead.
          daijy Daniel Dai added a comment -

          The plan is misformatted:

              |---C: (Name: LOForEach Schema: A#21:bag{#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)},tmp#21:bag{#22:tuple(a#18:bytearray)})
                  |   |
                  |   (Name: LOGenerate[false,false] Schema: A#21:bag{#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)},tmp#21:bag{#22:tuple(a#18:bytearray)})
                  |   |   |
                  |   |   A:(Name: Project Type: bag Uid: 21 Input: 0 Column: (*))
                  |   |   |
                  |   |   tmp:(Name: Project Type: bag Uid: 21 Input: 1 Column: (*))
                  |   |
                  |   |---A: (Name: LOInnerLoad[1] Schema: a#18:bytearray,b#19:bytearray,c#20:chararray)
          
          daijy Daniel Dai added a comment - The plan is misformatted: |---C: (Name: LOForEach Schema: A#21:bag{#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)},tmp#21:bag{#22:tuple(a#18:bytearray)}) | | | (Name: LOGenerate[ false , false ] Schema: A#21:bag{#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)},tmp#21:bag{#22:tuple(a#18:bytearray)}) | | | | | A:(Name: Project Type: bag Uid: 21 Input: 0 Column: (*)) | | | | | tmp:(Name: Project Type: bag Uid: 21 Input: 1 Column: (*)) | | | |---A: (Name: LOInnerLoad[1] Schema: a#18:bytearray,b#19:bytearray,c#20:chararray)
          daijy Daniel Dai added a comment -

          The real cause is the uid conflict in nested project.

          daijy Daniel Dai added a comment - The real cause is the uid conflict in nested project.

          Cheap code style comments:

          The if/else blocks should either be bracketed by {}s, or converted to
          fieldSchema.uid = needNewUid? LogicalExpression.getNextUid() : innerLoads.get(0).getProjection().getFieldSchema().uid;

          (Otherwise we get oddness when people try to edit the code and get confused about what is and isn't inside the if/else blocks).

          More expensive code content comments:

          Feels wrong to rely on checking for the parent, determining if it's a foreach, and returning a boolean to enforce id uniqueness. It's entirely possible that someone touches one of these operators without going through findReachableInnerLoad.... and still needs a unique id.

          Can we ensure the id is unique to begin with, when the tree is constructed?

          dvryaboy Dmitriy V. Ryaboy added a comment - Cheap code style comments: The if/else blocks should either be bracketed by {}s, or converted to fieldSchema.uid = needNewUid? LogicalExpression.getNextUid() : innerLoads.get(0).getProjection().getFieldSchema().uid; (Otherwise we get oddness when people try to edit the code and get confused about what is and isn't inside the if/else blocks). More expensive code content comments: Feels wrong to rely on checking for the parent, determining if it's a foreach, and returning a boolean to enforce id uniqueness. It's entirely possible that someone touches one of these operators without going through findReachableInnerLoad.... and still needs a unique id. Can we ensure the id is unique to begin with, when the tree is constructed?
          daijy Daniel Dai added a comment -

          Cheap code style comments

          sure will change

          More expensive code content comments

          Not sure if I completely understand your point, let me explain the design of foreach nested plan and why I make the change. Let me know if you need further explanation. Uid and schema inference process is very core to logical plan. If one changes anywhere in the process, he needs to make sure the existing functionality is not broken. In the patch, I change the way project infer its uid, because earlier, it does not generate new uid for the new bag after nested foreach. Here is how uid for foreach inner plan works:

          1. every foreach statement starts with LOInnerLoad, ends with LOGenerate
          2. simple foreach should keep uid, eg: foreach a generate $1, $2, we shall keep the uid for $1, $2, even if it is a bag column, there are couple of places make this assumption
          3. if input column is a bag, LOInnerLoad take the schema of its inner schema, eg, if $1 is bag#2 {t#3(x#4, y#5)}, LOInnerLoad will have the schema (x#4, y#5), it can be followed with nested operator
            # LOGenerate regenerates the bag after the inner operator pipeline, in this case, bag#2{t#3(x#4, y#5)}

            , we need to keep uid

          4. currently all nested operator does not change uid, except ForEach, that is the approach I took in the patch: unless see a ForEach, reuse uid

          Here are complete examples:

          b = foreach a generate a1, a2; (a0:xxxx, a1:chararray#1, a2:bag#2{t#3(x#4, y#5)})
          
          LOInnerLoad(a1:chararray)     LOInnerLoad(x#4, y#5)
                              \            /
                              LOGenerate(a1:chararray#1, a2:bag#2{t#3(x#4, y#5)})
          
          b = foreach a { c = filter a2 by x==1;generate a1, c; }; (a0:xxxx, a1:chararray#1, a2:bag#2{t#3(x#4, y#5)})
          
          LOInnerLoad(a1:chararray)     LOInnerLoad(x#4, y#5)
                              \            /
                               \        LOFilter(x#4, y#5)
                                \        /
                              LOGenerate(a1:chararray#1, c:bag#2{t#3(x#4, y#5)})
          
          b = foreach a { c = a2.x;generate a1, c; }; (a0:xxxx, a1:chararray#1, a2:bag#2{t#3(x#4, y#5)})
          
          LOInnerLoad(a1:chararray)     LOInnerLoad(x#4, y#5)
                              \            /
                               \        LOForEach(x#4)
                                \        /
                              LOGenerate(a1:chararray#1, c:bag#7{t#6(x#4)})
          
          daijy Daniel Dai added a comment - Cheap code style comments sure will change More expensive code content comments Not sure if I completely understand your point, let me explain the design of foreach nested plan and why I make the change. Let me know if you need further explanation. Uid and schema inference process is very core to logical plan. If one changes anywhere in the process, he needs to make sure the existing functionality is not broken. In the patch, I change the way project infer its uid, because earlier, it does not generate new uid for the new bag after nested foreach. Here is how uid for foreach inner plan works: every foreach statement starts with LOInnerLoad, ends with LOGenerate simple foreach should keep uid, eg: foreach a generate $1, $2, we shall keep the uid for $1, $2, even if it is a bag column, there are couple of places make this assumption if input column is a bag, LOInnerLoad take the schema of its inner schema, eg, if $1 is bag#2 {t#3(x#4, y#5)}, LOInnerLoad will have the schema (x#4, y#5), it can be followed with nested operator # LOGenerate regenerates the bag after the inner operator pipeline, in this case, bag#2{t#3(x#4, y#5)} , we need to keep uid currently all nested operator does not change uid, except ForEach, that is the approach I took in the patch: unless see a ForEach, reuse uid Here are complete examples: b = foreach a generate a1, a2; (a0:xxxx, a1:chararray#1, a2:bag#2{t#3(x#4, y#5)}) LOInnerLoad(a1:chararray) LOInnerLoad(x#4, y#5) \ / LOGenerate(a1:chararray#1, a2:bag#2{t#3(x#4, y#5)}) b = foreach a { c = filter a2 by x==1;generate a1, c; }; (a0:xxxx, a1:chararray#1, a2:bag#2{t#3(x#4, y#5)}) LOInnerLoad(a1:chararray) LOInnerLoad(x#4, y#5) \ / \ LOFilter(x#4, y#5) \ / LOGenerate(a1:chararray#1, c:bag#2{t#3(x#4, y#5)}) b = foreach a { c = a2.x;generate a1, c; }; (a0:xxxx, a1:chararray#1, a2:bag#2{t#3(x#4, y#5)}) LOInnerLoad(a1:chararray) LOInnerLoad(x#4, y#5) \ / \ LOForEach(x#4) \ / LOGenerate(a1:chararray#1, c:bag#7{t#6(x#4)})

          Daniel, I think I understand. But the method looks a bit off to me, codestyle-wise, as it moves logic that properly belongs elsewhere into itself and conflates what it's supposed to do (find a reachable inner load) with what you want to do with the thing it found (change the uid).

          dvryaboy Dmitriy V. Ryaboy added a comment - Daniel, I think I understand. But the method looks a bit off to me, codestyle-wise, as it moves logic that properly belongs elsewhere into itself and conflates what it's supposed to do (find a reachable inner load) with what you want to do with the thing it found (change the uid).
          daijy Daniel Dai added a comment -

          Hi, Dmitriy, yes I also check LOForEach along the way finding innerload. Otherwise I need to traverse the plan two times. Maybe I can change the name of the function to reveal this?

          daijy Daniel Dai added a comment - Hi, Dmitriy, yes I also check LOForEach along the way finding innerload. Otherwise I need to traverse the plan two times. Maybe I can change the name of the function to reveal this?
          daijy Daniel Dai added a comment -

          Address Dmitriy's comment, put a comment to findReacheableInnerLoadFromBoundaryProject to describe the nature of the function.

          daijy Daniel Dai added a comment - Address Dmitriy's comment, put a comment to findReacheableInnerLoadFromBoundaryProject to describe the nature of the function.
          thejas Thejas Nair added a comment -

          +1

          thejas Thejas Nair added a comment - +1
          daijy Daniel Dai added a comment -

          test-patch:
          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

          javadoc warning is unrelated.

          Patch committed to 0.10/trunk.

          daijy Daniel Dai added a comment - test-patch: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. javadoc warning is unrelated. Patch committed to 0.10/trunk.

          People

            daijy Daniel Dai
            vivekp Vivek Padmanabhan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: