Pig
  1. Pig
  2. PIG-2563

IndexOutOfBoundsException: while projecting fields from a bag

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.1, 0.10.0
    • Fix Version/s: 0.10.0, 0.11
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The below script fails with Pig 0.9 / Pig 0.10 but works fine for Pig 0.8.

      A = load 'i1' as (a,b,c:chararray);
      B = load 'i2' as (d,e,f:chararray);
      C = cogroup A by a, B by d;
      D = foreach C { 
        tmp = B.d;
        tmp_dis = distinct tmp;
        generate A,B,tmp_dis ; } ;
      E = foreach D generate B.(d,e) as v;
      dump E;
      

      The script fails with the below exception. Looks like DereferenceExpression is using wrong schema to build inner schema.

      java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
      at java.util.ArrayList.RangeCheck(ArrayList.java:547)
      at java.util.ArrayList.get(ArrayList.java:322)
      at org.apache.pig.newplan.logical.relational.LogicalSchema.getField(LogicalSchema.java:653)
      at org.apache.pig.newplan.logical.expression.DereferenceExpression.getFieldSchema(DereferenceExpression.java:167)
      at org.apache.pig.newplan.logical.relational.LOGenerate.getSchema(LOGenerate.java:88)
      at org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visit(TypeCheckingRelVisitor.java:160)
      at org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:242)

      1. PIG-2563-2.patch
        8 kB
        Daniel Dai
      2. PIG-2563-1.patch
        7 kB
        Daniel Dai

        Activity

        Hide
        Jonathan Coveney added a comment -

        Good catch! I've seen similar errors before, but was never able to isolate it. I actually isolated it with an even smaller snippet below:

        A = load 'i1' as (a,b,c:chararray);
        B = GROUP A BY a;
        C = foreach B { 
          tmp = A.a;
          generate A, tmp; } ;
        D = foreach C generate A.(a,b) as v;
        

        Perhaps related is the following, which has a weird parse error:

        A = load 'i1' as (a,b,c:chararray);
        B = GROUP A BY a;
        C = foreach B { 
          tmp = A.a;
          generate A; } ;
        D = foreach C generate A.(a,b) as v;
        
        Show
        Jonathan Coveney added a comment - Good catch! I've seen similar errors before, but was never able to isolate it. I actually isolated it with an even smaller snippet below: A = load 'i1' as (a,b,c:chararray); B = GROUP A BY a; C = foreach B { tmp = A.a; generate A, tmp; } ; D = foreach C generate A.(a,b) as v; Perhaps related is the following, which has a weird parse error: A = load 'i1' as (a,b,c:chararray); B = GROUP A BY a; C = foreach B { tmp = A.a; generate A; } ; D = foreach C generate A.(a,b) as v;
        Hide
        Daniel Dai added a comment -

        Logical plan for the foreach nested plan is wrong. Using Jonathan's first test case as example:

        ---C: (Name: LOForEach Schema: A#21:bag {#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)}

        ,tmp#21:bag

        {#22:tuple(a#18:bytearray)}

        )

         
        (Name: LOGenerate[false,false] Schema: A#21:bag {#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)}

        ,tmp#21:bag

        {#22:tuple(a#18:bytearray)}

        )

           
          A:(Name: Project Type: bag Uid: 21 Input: 0 Column: )
           
          tmp:(Name: Project Type: bag Uid: 21 Input: 1 Column: )
         
          ---A: (Name: LOInnerLoad[1] Schema: a#18:bytearray,b#19:bytearray,c#20:chararray)

        Here we only generate one LOInnerLoad. We should generate two LOInnerLoad instead.

        Show
        Daniel Dai added a comment - Logical plan for the foreach nested plan is wrong. Using Jonathan's first test case as example: ---C: (Name: LOForEach Schema: A#21:bag {#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)} ,tmp#21:bag {#22:tuple(a#18:bytearray)} )   (Name: LOGenerate [false,false] Schema: A#21:bag {#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)} ,tmp#21:bag {#22:tuple(a#18:bytearray)} )       A:(Name: Project Type: bag Uid: 21 Input: 0 Column: )       tmp:(Name: Project Type: bag Uid: 21 Input: 1 Column: )     ---A: (Name: LOInnerLoad [1] Schema: a#18:bytearray,b#19:bytearray,c#20:chararray) Here we only generate one LOInnerLoad. We should generate two LOInnerLoad instead.
        Hide
        Daniel Dai added a comment -

        The plan is misformatted:

            |---C: (Name: LOForEach Schema: A#21:bag{#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)},tmp#21:bag{#22:tuple(a#18:bytearray)})
                |   |
                |   (Name: LOGenerate[false,false] Schema: A#21:bag{#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)},tmp#21:bag{#22:tuple(a#18:bytearray)})
                |   |   |
                |   |   A:(Name: Project Type: bag Uid: 21 Input: 0 Column: (*))
                |   |   |
                |   |   tmp:(Name: Project Type: bag Uid: 21 Input: 1 Column: (*))
                |   |
                |   |---A: (Name: LOInnerLoad[1] Schema: a#18:bytearray,b#19:bytearray,c#20:chararray)
        
        Show
        Daniel Dai added a comment - The plan is misformatted: |---C: (Name: LOForEach Schema: A#21:bag{#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)},tmp#21:bag{#22:tuple(a#18:bytearray)}) | | | (Name: LOGenerate[ false , false ] Schema: A#21:bag{#22:tuple(a#18:bytearray,b#19:bytearray,c#20:chararray)},tmp#21:bag{#22:tuple(a#18:bytearray)}) | | | | | A:(Name: Project Type: bag Uid: 21 Input: 0 Column: (*)) | | | | | tmp:(Name: Project Type: bag Uid: 21 Input: 1 Column: (*)) | | | |---A: (Name: LOInnerLoad[1] Schema: a#18:bytearray,b#19:bytearray,c#20:chararray)
        Hide
        Daniel Dai added a comment -

        The real cause is the uid conflict in nested project.

        Show
        Daniel Dai added a comment - The real cause is the uid conflict in nested project.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Cheap code style comments:

        The if/else blocks should either be bracketed by {}s, or converted to
        fieldSchema.uid = needNewUid? LogicalExpression.getNextUid() : innerLoads.get(0).getProjection().getFieldSchema().uid;

        (Otherwise we get oddness when people try to edit the code and get confused about what is and isn't inside the if/else blocks).

        More expensive code content comments:

        Feels wrong to rely on checking for the parent, determining if it's a foreach, and returning a boolean to enforce id uniqueness. It's entirely possible that someone touches one of these operators without going through findReachableInnerLoad.... and still needs a unique id.

        Can we ensure the id is unique to begin with, when the tree is constructed?

        Show
        Dmitriy V. Ryaboy added a comment - Cheap code style comments: The if/else blocks should either be bracketed by {}s, or converted to fieldSchema.uid = needNewUid? LogicalExpression.getNextUid() : innerLoads.get(0).getProjection().getFieldSchema().uid; (Otherwise we get oddness when people try to edit the code and get confused about what is and isn't inside the if/else blocks). More expensive code content comments: Feels wrong to rely on checking for the parent, determining if it's a foreach, and returning a boolean to enforce id uniqueness. It's entirely possible that someone touches one of these operators without going through findReachableInnerLoad.... and still needs a unique id. Can we ensure the id is unique to begin with, when the tree is constructed?
        Hide
        Daniel Dai added a comment -

        Cheap code style comments

        sure will change

        More expensive code content comments

        Not sure if I completely understand your point, let me explain the design of foreach nested plan and why I make the change. Let me know if you need further explanation. Uid and schema inference process is very core to logical plan. If one changes anywhere in the process, he needs to make sure the existing functionality is not broken. In the patch, I change the way project infer its uid, because earlier, it does not generate new uid for the new bag after nested foreach. Here is how uid for foreach inner plan works:

        1. every foreach statement starts with LOInnerLoad, ends with LOGenerate
        2. simple foreach should keep uid, eg: foreach a generate $1, $2, we shall keep the uid for $1, $2, even if it is a bag column, there are couple of places make this assumption
        3. if input column is a bag, LOInnerLoad take the schema of its inner schema, eg, if $1 is bag#2 {t#3(x#4, y#5)}, LOInnerLoad will have the schema (x#4, y#5), it can be followed with nested operator
          # LOGenerate regenerates the bag after the inner operator pipeline, in this case, bag#2{t#3(x#4, y#5)}

          , we need to keep uid

        4. currently all nested operator does not change uid, except ForEach, that is the approach I took in the patch: unless see a ForEach, reuse uid

        Here are complete examples:

        b = foreach a generate a1, a2; (a0:xxxx, a1:chararray#1, a2:bag#2{t#3(x#4, y#5)})
        
        LOInnerLoad(a1:chararray)     LOInnerLoad(x#4, y#5)
                            \            /
                            LOGenerate(a1:chararray#1, a2:bag#2{t#3(x#4, y#5)})
        
        b = foreach a { c = filter a2 by x==1;generate a1, c; }; (a0:xxxx, a1:chararray#1, a2:bag#2{t#3(x#4, y#5)})
        
        LOInnerLoad(a1:chararray)     LOInnerLoad(x#4, y#5)
                            \            /
                             \        LOFilter(x#4, y#5)
                              \        /
                            LOGenerate(a1:chararray#1, c:bag#2{t#3(x#4, y#5)})
        
        b = foreach a { c = a2.x;generate a1, c; }; (a0:xxxx, a1:chararray#1, a2:bag#2{t#3(x#4, y#5)})
        
        LOInnerLoad(a1:chararray)     LOInnerLoad(x#4, y#5)
                            \            /
                             \        LOForEach(x#4)
                              \        /
                            LOGenerate(a1:chararray#1, c:bag#7{t#6(x#4)})
        
        Show
        Daniel Dai added a comment - Cheap code style comments sure will change More expensive code content comments Not sure if I completely understand your point, let me explain the design of foreach nested plan and why I make the change. Let me know if you need further explanation. Uid and schema inference process is very core to logical plan. If one changes anywhere in the process, he needs to make sure the existing functionality is not broken. In the patch, I change the way project infer its uid, because earlier, it does not generate new uid for the new bag after nested foreach. Here is how uid for foreach inner plan works: every foreach statement starts with LOInnerLoad, ends with LOGenerate simple foreach should keep uid, eg: foreach a generate $1, $2, we shall keep the uid for $1, $2, even if it is a bag column, there are couple of places make this assumption if input column is a bag, LOInnerLoad take the schema of its inner schema, eg, if $1 is bag#2 {t#3(x#4, y#5)}, LOInnerLoad will have the schema (x#4, y#5), it can be followed with nested operator # LOGenerate regenerates the bag after the inner operator pipeline, in this case, bag#2{t#3(x#4, y#5)} , we need to keep uid currently all nested operator does not change uid, except ForEach, that is the approach I took in the patch: unless see a ForEach, reuse uid Here are complete examples: b = foreach a generate a1, a2; (a0:xxxx, a1:chararray#1, a2:bag#2{t#3(x#4, y#5)}) LOInnerLoad(a1:chararray) LOInnerLoad(x#4, y#5) \ / LOGenerate(a1:chararray#1, a2:bag#2{t#3(x#4, y#5)}) b = foreach a { c = filter a2 by x==1;generate a1, c; }; (a0:xxxx, a1:chararray#1, a2:bag#2{t#3(x#4, y#5)}) LOInnerLoad(a1:chararray) LOInnerLoad(x#4, y#5) \ / \ LOFilter(x#4, y#5) \ / LOGenerate(a1:chararray#1, c:bag#2{t#3(x#4, y#5)}) b = foreach a { c = a2.x;generate a1, c; }; (a0:xxxx, a1:chararray#1, a2:bag#2{t#3(x#4, y#5)}) LOInnerLoad(a1:chararray) LOInnerLoad(x#4, y#5) \ / \ LOForEach(x#4) \ / LOGenerate(a1:chararray#1, c:bag#7{t#6(x#4)})
        Hide
        Dmitriy V. Ryaboy added a comment -

        Daniel, I think I understand. But the method looks a bit off to me, codestyle-wise, as it moves logic that properly belongs elsewhere into itself and conflates what it's supposed to do (find a reachable inner load) with what you want to do with the thing it found (change the uid).

        Show
        Dmitriy V. Ryaboy added a comment - Daniel, I think I understand. But the method looks a bit off to me, codestyle-wise, as it moves logic that properly belongs elsewhere into itself and conflates what it's supposed to do (find a reachable inner load) with what you want to do with the thing it found (change the uid).
        Hide
        Daniel Dai added a comment -

        Hi, Dmitriy, yes I also check LOForEach along the way finding innerload. Otherwise I need to traverse the plan two times. Maybe I can change the name of the function to reveal this?

        Show
        Daniel Dai added a comment - Hi, Dmitriy, yes I also check LOForEach along the way finding innerload. Otherwise I need to traverse the plan two times. Maybe I can change the name of the function to reveal this?
        Hide
        Daniel Dai added a comment -

        Address Dmitriy's comment, put a comment to findReacheableInnerLoadFromBoundaryProject to describe the nature of the function.

        Show
        Daniel Dai added a comment - Address Dmitriy's comment, put a comment to findReacheableInnerLoadFromBoundaryProject to describe the nature of the function.
        Hide
        Thejas M Nair added a comment -

        +1

        Show
        Thejas M Nair added a comment - +1
        Hide
        Daniel Dai added a comment -

        test-patch:
        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        javadoc warning is unrelated.

        Patch committed to 0.10/trunk.

        Show
        Daniel Dai added a comment - test-patch: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. javadoc warning is unrelated. Patch committed to 0.10/trunk.

          People

          • Assignee:
            Daniel Dai
            Reporter:
            Vivek Padmanabhan
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development