Pig
  1. Pig
  2. PIG-2312

NPE when relation and column share the same name and used in Nested Foreach

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.9.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Unlinking from 0.11 for the same reason

      Description

      With Pig0.9, if a relation and a column has the same name and if the column is used in a nested foreach, the script execution fails
      while compiling.
      The below is the trace;

      java.lang.NullPointerException
      	at org.apache.pig.newplan.logical.visitor.ScalarVisitor$1.visit(ScalarVisitor.java:63)
      	at org.apache.pig.newplan.logical.expression.ScalarExpression.accept(ScalarExpression.java:109)
      	at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
      	at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
      	at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142)
      	at org.apache.pig.newplan.logical.relational.LOSort.accept(LOSort.java:119)
      	at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
      	at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:104)
      	at org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:74)
      	at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
      	at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
      	at org.apache.pig.PigServer$Graph.compile(PigServer.java:1674)
      	at org.apache.pig.PigServer$Graph.compile(PigServer.java:1666)
      	at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1391)
      	at org.apache.pig.PigServer.execute(PigServer.java:1293)
      	at org.apache.pig.PigServer.executeBatch(PigServer.java:359)
      	at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:131)
      	at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:192)
      	at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
      	at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
      	at org.apache.pig.Main.run(Main.java:553)
      	at org.apache.pig.Main.main(Main.java:108)
      
      

      This could be reproduced with the below script

      f3 = load 'input.txt' as (a1:chararray);
      A = load '3char_1long_tab' as (f1:chararray, f2:chararray, f3:chararray,ct:long);
      
      B = GROUP A  BY f1;
      C =    FOREACH B {
              zip_ordered = ORDER A BY f3 ASC; 
              GENERATE
                      FLATTEN(group) AS f1,	
                      A.(f3, ct),
      		COUNT(zip_ordered),
                      SUM(A.ct) AS total;
        };
      
      STORE C INTO 'deletemeanytimeplease';
      
      

      Checked with a unit test in trunk, the behavior is still same.

      1. PIG-2312_1.patch
        2 kB
        Vivek Padmanabhan
      2. PIG-2312_2.patch
        10 kB
        Vivek Padmanabhan
      3. PIG-2312_3.patch
        18 kB
        Vivek Padmanabhan

        Activity

        Hide
        Vivek Padmanabhan added a comment -

        I think the logical plan generator is creating a ScalarExpression instead of ProjectExpression which causes this NPE.

        LogicalPlanGenerator.g
        Operator op = builder.lookupOperator( alias );
        if( op != null && ( schema == null || schema.getFieldPosition( alias ) == -1 ) )

        { $expr = new ScalarExpression( plan, op, inForeachPlan ? $foreach_clause::foreachOp : $GScope::currentOp ); $expr.setLocation( loc ); }


        The lookupOperator returns the f3 LOAD operator and schema.getFieldPosition( alias ) return -1 since it searches only the outer fields (group and A{} in this case )

        Show
        Vivek Padmanabhan added a comment - I think the logical plan generator is creating a ScalarExpression instead of ProjectExpression which causes this NPE. LogicalPlanGenerator.g Operator op = builder.lookupOperator( alias ); if( op != null && ( schema == null || schema.getFieldPosition( alias ) == -1 ) ) { $expr = new ScalarExpression( plan, op, inForeachPlan ? $foreach_clause::foreachOp : $GScope::currentOp ); $expr.setLocation( loc ); } The lookupOperator returns the f3 LOAD operator and schema.getFieldPosition( alias ) return -1 since it searches only the outer fields (group and A{} in this case )
        Hide
        Olga Natkovich added a comment -

        Did this used to work in 0.8?

        Show
        Olga Natkovich added a comment - Did this used to work in 0.8?
        Hide
        Vivek Padmanabhan added a comment -

        Oops, extremely sorry, I thought I had mentioned it in my previous comment.Yes, this works fine in 0.8

        Show
        Vivek Padmanabhan added a comment - Oops, extremely sorry, I thought I had mentioned it in my previous comment.Yes, this works fine in 0.8
        Hide
        Vivek Padmanabhan added a comment -

        Attempting for a patch for this issue.

        Show
        Vivek Padmanabhan added a comment - Attempting for a patch for this issue.
        Hide
        Thejas M Nair added a comment -

        The patch PIG-2312_1.patch disables use of scalar within operators inside a nested-foreach.
        For example, for the following query -

        sc_load = load 'x' as (x, y , z);
        
        l = load 'x' as (x,y , A : bag {t : ( i : int)} );
        f = foreach l { 
          fil = filter A by sc_load.$0 > $0 ;
          generate fil;
        }
        

        it throws the following exception -

        <line 5, column 20> Invalid field projection. Projected field [sc_load] does not exist in schema: i:int.
        
        Show
        Thejas M Nair added a comment - The patch PIG-2312 _1.patch disables use of scalar within operators inside a nested-foreach. For example, for the following query - sc_load = load 'x' as (x, y , z); l = load 'x' as (x,y , A : bag {t : ( i : int )} ); f = foreach l { fil = filter A by sc_load.$0 > $0 ; generate fil; } it throws the following exception - <line 5, column 20> Invalid field projection. Projected field [sc_load] does not exist in schema: i: int .
        Hide
        Thejas M Nair added a comment -

        The reason for the original bug is that the logic that determines if an alias used is to be mapped to a relation is not checking the input schema for the nested operation. It is looking for that alias in the schema of input relation of the foreach, instead of looking for the schema in the input relation for the order-by .

        Show
        Thejas M Nair added a comment - The reason for the original bug is that the logic that determines if an alias used is to be mapped to a relation is not checking the input schema for the nested operation. It is looking for that alias in the schema of input relation of the foreach, instead of looking for the schema in the input relation for the order-by .
        Hide
        Vivek Padmanabhan added a comment -

        Yes, I missed this basic case. Thanks for pointing out. I will cancel this patch.

        Show
        Vivek Padmanabhan added a comment - Yes, I missed this basic case. Thanks for pointing out. I will cancel this patch.
        Hide
        Olga Natkovich added a comment -

        Hi Vivek,

        Are you planning to re-submit the patch?

        Show
        Olga Natkovich added a comment - Hi Vivek, Are you planning to re-submit the patch?
        Hide
        Vivek Padmanabhan added a comment -

        Hi Olga,Sorry for the delayed response. I am attempting another patch for this issue.

        I have added some more test cases to cover the scenarios that I have faced while debugging.
        The patch includes changes in LogicalPlanGenerator.g and LogicalPlanBuilder (added a new map of operators).

        Apart from the general case mentioned in the above comment, I have tried to make the patch work with queries like;

        a)
        f3 = load ....
        A = load 'input' using PigStorage(',') as (f1:chararray, f2:chararray, f3:chararray,ct:long);
        B = GROUP A BY f1;
        C = FOREACH B { dis = DISTINCT A ;
        ordered = ORDER dis BY f3 ASC; .....

        b)
        f3 = load ....
        A = load 'input' using PigStorage(',') as (f1:chararray, f2:chararray, f3:chararray,ct:long);
        B = GROUP A BY f1;
        C = FOREACH B { f3 = DISTINCT A ;
        ordered = ORDER f3 BY f3 ASC; .....

        Show
        Vivek Padmanabhan added a comment - Hi Olga,Sorry for the delayed response. I am attempting another patch for this issue. I have added some more test cases to cover the scenarios that I have faced while debugging. The patch includes changes in LogicalPlanGenerator.g and LogicalPlanBuilder (added a new map of operators). Apart from the general case mentioned in the above comment, I have tried to make the patch work with queries like; a) f3 = load .... A = load 'input' using PigStorage(',') as (f1:chararray, f2:chararray, f3:chararray,ct:long); B = GROUP A BY f1; C = FOREACH B { dis = DISTINCT A ; ordered = ORDER dis BY f3 ASC; ..... b) f3 = load .... A = load 'input' using PigStorage(',') as (f1:chararray, f2:chararray, f3:chararray,ct:long); B = GROUP A BY f1; C = FOREACH B { f3 = DISTINCT A ; ordered = ORDER f3 BY f3 ASC; .....
        Hide
        Daniel Dai added a comment -

        This should be the right fix. It would be better to add a test case for cogroup as well.

        Show
        Daniel Dai added a comment - This should be the right fix. It would be better to add a test case for cogroup as well.
        Hide
        Daniel Dai added a comment -

        TestLimitVariable.testNestedLimitVariable1 fail with the patch. Can you take a look?

        Show
        Daniel Dai added a comment - TestLimitVariable.testNestedLimitVariable1 fail with the patch. Can you take a look?
        Hide
        Vivek Padmanabhan added a comment -

        The patch is colliding with the new scalar support for limit in nested foreach. Moreover, I found another case which is failing ;

        sc_load = load 'input1' as (x, y , z);
        l = load 'input2' as (x,y , A : bag {t : ( i : int)} );
        f = foreach l {
        lim = limit A  sc_load.$0;
        fil = filter lim by sc_load.$0 > $0 ; generate fil; }
        
        Show
        Vivek Padmanabhan added a comment - The patch is colliding with the new scalar support for limit in nested foreach. Moreover, I found another case which is failing ; sc_load = load 'input1' as (x, y , z); l = load 'input2' as (x,y , A : bag {t : ( i : int )} ); f = foreach l { lim = limit A sc_load.$0; fil = filter lim by sc_load.$0 > $0 ; generate fil; }
        Hide
        Gianmarco De Francisci Morales added a comment -

        Strange, the syntax of this last snippet should not be allowed as sc_load.$0 is a projection and not a scalar. Do you get an NPE also with this?

        Regarding the approach in the patch, I am not sure we want to disable the scalars in nested foreach. One of the use cases is trimming an inner bag to a fixed size for all the tuples in the outer bag.

        Show
        Gianmarco De Francisci Morales added a comment - Strange, the syntax of this last snippet should not be allowed as sc_load.$0 is a projection and not a scalar. Do you get an NPE also with this? Regarding the approach in the patch, I am not sure we want to disable the scalars in nested foreach. One of the use cases is trimming an inner bag to a fixed size for all the tuples in the outer bag.
        Hide
        Vivek Padmanabhan added a comment -

        The second case is failing with the patch attached. And it should not be disabling scalar in nested foreach. I am doing a rework on the patch.

        Show
        Vivek Padmanabhan added a comment - The second case is failing with the patch attached. And it should not be disabling scalar in nested foreach. I am doing a rework on the patch.
        Hide
        Vivek Padmanabhan added a comment -

        Attaching another patch to fix the failures. I am not sure whether I have complicated the code,but had to write a logic to find the actual input operator.
        (Ran test-core with this patch except for
        TestEvalPipeline2 and TestMergeJoin which are timing out in my env)

        Show
        Vivek Padmanabhan added a comment - Attaching another patch to fix the failures. I am not sure whether I have complicated the code,but had to write a logic to find the actual input operator. (Ran test-core with this patch except for TestEvalPipeline2 and TestMergeJoin which are timing out in my env)
        Hide
        Daniel Dai added a comment -

        The fix is quite involved and we need more time to figure the best way. Unlink from 0.10.

        Show
        Daniel Dai added a comment - The fix is quite involved and we need more time to figure the best way. Unlink from 0.10.
        Hide
        Alan Gates added a comment -

        Latest patch no longer applies to trunk.

        Show
        Alan Gates added a comment - Latest patch no longer applies to trunk.

          People

          • Assignee:
            Vivek Padmanabhan
            Reporter:
            Vivek Padmanabhan
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development