Pig
  1. Pig
  2. PIG-1482

Pig gets confused when more than one loader is involved

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In case of two relations being loaded using different loader, joined, grouped and projected, pig gets confused in trying to find appropriate loader for the requested cast. Consider the following script :-

      A = LOAD 'data1' USING PigStorage() AS (s, m, l);
      B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3;
      C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 :0) as v3:int;

      D = LOAD 'data2' USING TextLoader() AS (a);
      E = JOIN C BY v1, D BY a USING 'replicated';

      F = GROUP E BY (v1, a);
      G = FOREACH F GENERATE (chararray)group.v1, group.a;

      dump G;

      This throws the error, stack trace of which is in the next comment

      1. jira-1482-final.patch
        41 kB
        Xuefu Zhang
      2. jira-1482-final.patch
        32 kB
        Xuefu Zhang
      3. jira-1482-final.patch
        32 kB
        Xuefu Zhang
      4. jira-1482-final-1.patch
        42 kB
        Xuefu Zhang
      5. jira-1482-final-2.patch
        41 kB
        Xuefu Zhang

        Activity

        Hide
        Ankur added a comment -

        ERROR 1065: Found more than one load function to use: [PigStorage, TextLoader]

        org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias K
        at org.apache.pig.PigServer.openIterator(PigServer.java:521)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
        at org.apache.pig.Main.main(Main.java:391)
        Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias K
        at org.apache.pig.PigServer.store(PigServer.java:577)
        at org.apache.pig.PigServer.openIterator(PigServer.java:504)
        ... 6 more
        Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An unexpected exception caused the validation to stop
        at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104)
        at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
        at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
        at org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:89)
        at org.apache.pig.PigServer.validate(PigServer.java:930)
        at org.apache.pig.PigServer.compileLp(PigServer.java:884)
        at org.apache.pig.PigServer.store(PigServer.java:568)
        ... 7 more
        Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1053: Cannot resolve load function to use for casting from bytearray to chararray.
        at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:1775)
        at org.apache.pig.impl.logicalLayer.LOCast.visit(LOCast.java:67)
        at org.apache.pig.impl.logicalLayer.LOCast.visit(LOCast.java:32)
        at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
        at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
        at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.checkInnerPlan(TypeCheckingVisitor.java:2819)
        at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2723)
        at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130)
        at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45)
        at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
        at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
        at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
        ... 13 more
        Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1065: Found more than one load function to use: [PigStorage, TextLoader]
        at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3161)
        at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3176)
        at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3103)
        at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3176)
        at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3103)

        Show
        Ankur added a comment - ERROR 1065: Found more than one load function to use: [PigStorage, TextLoader] org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias K at org.apache.pig.PigServer.openIterator(PigServer.java:521) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:391) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias K at org.apache.pig.PigServer.store(PigServer.java:577) at org.apache.pig.PigServer.openIterator(PigServer.java:504) ... 6 more Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An unexpected exception caused the validation to stop at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) at org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:89) at org.apache.pig.PigServer.validate(PigServer.java:930) at org.apache.pig.PigServer.compileLp(PigServer.java:884) at org.apache.pig.PigServer.store(PigServer.java:568) ... 7 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1053: Cannot resolve load function to use for casting from bytearray to chararray. at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:1775) at org.apache.pig.impl.logicalLayer.LOCast.visit(LOCast.java:67) at org.apache.pig.impl.logicalLayer.LOCast.visit(LOCast.java:32) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.checkInnerPlan(TypeCheckingVisitor.java:2819) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2723) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) ... 13 more Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1065: Found more than one load function to use: [PigStorage, TextLoader] at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3161) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3176) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3103) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3176) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3103)
        Hide
        Ankur added a comment -

        Casting early alleviates the problem. So this makes the above script work

        C = FOREACH B GENERATE (chararray) v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 :0) as v3:int;

        Show
        Ankur added a comment - Casting early alleviates the problem. So this makes the above script work C = FOREACH B GENERATE (chararray) v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 :0) as v3:int;
        Hide
        Ankur added a comment -

        forgot to add

        Include this change as well for the above script to work

        G = FOREACH F GENERATE group.v1, group.a;

        Show
        Ankur added a comment - forgot to add Include this change as well for the above script to work G = FOREACH F GENERATE group.v1, group.a;
        Hide
        Xuefu Zhang added a comment -

        Manual Hudson run result:
        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] -1 release audit. The applied patch generated 424 release audit warnings (more than the trunk's current 423 warnings).

        The release audit diff was due to new public methods introduced in Schema.java.

        Show
        Xuefu Zhang added a comment - Manual Hudson run result: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 424 release audit warnings (more than the trunk's current 423 warnings). The release audit diff was due to new public methods introduced in Schema.java.
        Hide
        Xuefu Zhang added a comment -

        minor changes added based on review feedback.

        Show
        Xuefu Zhang added a comment - minor changes added based on review feedback.
        Hide
        Xuefu Zhang added a comment -

        The patch passed all unit tests and end-to-end tests.

        Show
        Xuefu Zhang added a comment - The patch passed all unit tests and end-to-end tests.
        Hide
        Thejas M Nair added a comment -

        Patch review comments -

        • Schema.java
                  public FieldSchema(String a, Schema s, byte t)  throws FrontendException {
                      alias = a;
                      schema = s;
                      log.debug("t: " + t + " Bag: " + DataType.BAG + " tuple: " + DataType.TUPLE);
                      
                      /*
                       * The following check is removed because it may not be always true. As a matter of 
                       * fact, the condition can be produced using other constructors anyway.
                       *
                      if ((null != s) && !(DataType.isSchemaType(t))) {
                          int errCode = 1020;
                          throw new FrontendException("Only a BAG or TUPLE can have schemas. Got "
                                  + DataType.findTypeName(t), errCode, PigException.INPUT);
                      }
                      */
          

          I think some other code paths might be relying on this constructor for error checking. It would be safer to create another constructor with a check boolean argument

                  public FieldSchema(String a, Schema s, byte t, boolean innerTypeCheck)  
          

          and call that from above constructor and from FieldSchema.copyAndLink(..)

        • In LOStream.java.getSchema() mIsSchemaComputed is used to keep track of whether the fieldschema parents have been set.
          I think it will be better to use a different variable for the purpose - it will be more readable, and also not likely to break any assumptions people are likely to make about this variable that is from the LogicalOperator class.
        • TypeCheckingVisitor.java insertCastForUDF is called on input of udf , it seems like same logic should be used for other expressions as well (instead of insertCast(.) ). Also, insertCastForUDF(..) and insertCast(..) have only two lines different, we can share rest of the code.
          private void insertCastForUDF(LOUserFunc udf,
              		FieldSchema fromFS, FieldSchema toFs, ExpressionOperator predecessor){
          
          toFs.setParent( fromFS.canonicalName, predecessor );
          insertCast(udf, toFs.type, toFs, predecessor);
          }
          
        • TypeCheckingVisitor.java In visit(LOCast), it seems like we can just pick any of the matching predecessor load functions, shouldn't we check if all the FuncSpec returned are the same ?
                      	for( Map.Entry<String, LogicalOperator> entry : canonicalMap.entrySet() ) {
                              FuncSpec loadFuncSpec = getLoadFuncSpec( entry.getValue(), entry.getKey() );
                              cast.setLoadFuncSpec( loadFuncSpec );
                      	}
          
        • LOProject.java
          the commented line can be removed -
          +//                         mFieldSchema.setParent(fs.canonicalName, expressionOperator);
          
        Show
        Thejas M Nair added a comment - Patch review comments - Schema.java public FieldSchema( String a, Schema s, byte t) throws FrontendException { alias = a; schema = s; log.debug( "t: " + t + " Bag: " + DataType.BAG + " tuple: " + DataType.TUPLE); /* * The following check is removed because it may not be always true . As a matter of * fact, the condition can be produced using other constructors anyway. * if (( null != s) && !(DataType.isSchemaType(t))) { int errCode = 1020; throw new FrontendException( "Only a BAG or TUPLE can have schemas. Got " + DataType.findTypeName(t), errCode, PigException.INPUT); } */ I think some other code paths might be relying on this constructor for error checking. It would be safer to create another constructor with a check boolean argument public FieldSchema( String a, Schema s, byte t, boolean innerTypeCheck) and call that from above constructor and from FieldSchema.copyAndLink(..) In LOStream.java.getSchema() mIsSchemaComputed is used to keep track of whether the fieldschema parents have been set. I think it will be better to use a different variable for the purpose - it will be more readable, and also not likely to break any assumptions people are likely to make about this variable that is from the LogicalOperator class. TypeCheckingVisitor.java insertCastForUDF is called on input of udf , it seems like same logic should be used for other expressions as well (instead of insertCast(.) ). Also, insertCastForUDF(..) and insertCast(..) have only two lines different, we can share rest of the code. private void insertCastForUDF(LOUserFunc udf, FieldSchema fromFS, FieldSchema toFs, ExpressionOperator predecessor){ toFs.setParent( fromFS.canonicalName, predecessor ); insertCast(udf, toFs.type, toFs, predecessor); } TypeCheckingVisitor.java In visit(LOCast), it seems like we can just pick any of the matching predecessor load functions, shouldn't we check if all the FuncSpec returned are the same ? for ( Map.Entry< String , LogicalOperator> entry : canonicalMap.entrySet() ) { FuncSpec loadFuncSpec = getLoadFuncSpec( entry.getValue(), entry.getKey() ); cast .setLoadFuncSpec( loadFuncSpec ); } LOProject.java the commented line can be removed - + // mFieldSchema.setParent(fs.canonicalName, expressionOperator);
        Hide
        Xuefu Zhang added a comment -

        Updated the patch based on the review comments.

        For comments above, the one next to the last, the map should only contain one entry. Before the result is obtained, exception is thrown anytime two different loadfunspec's are found. It was done that way before.

        Show
        Xuefu Zhang added a comment - Updated the patch based on the review comments. For comments above, the one next to the last, the map should only contain one entry. Before the result is obtained, exception is thrown anytime two different loadfunspec's are found. It was done that way before.
        Hide
        Thejas M Nair added a comment -

        Regarding jira-1482-final-1.patch
        +1 , just one comment from review of previous patch needs to be addressed -

        • As mentioned in 3rd comment, the code in insertCast() can be reused from insertCastForUdf()
          private void insertCastForUDF(LOUserFunc udf,
              		FieldSchema fromFS, FieldSchema toFs, ExpressionOperator predecessor){
          
               toFs.setParent( fromFS.canonicalName, predecessor );
               insertCast(udf, toFs.type, toFs, predecessor);
          }
          
        Show
        Thejas M Nair added a comment - Regarding jira-1482-final-1.patch +1 , just one comment from review of previous patch needs to be addressed - As mentioned in 3rd comment, the code in insertCast() can be reused from insertCastForUdf() private void insertCastForUDF(LOUserFunc udf, FieldSchema fromFS, FieldSchema toFs, ExpressionOperator predecessor){ toFs.setParent( fromFS.canonicalName, predecessor ); insertCast(udf, toFs.type, toFs, predecessor); }
        Hide
        Xuefu Zhang added a comment -

        Updated based on review comments above.

        Show
        Xuefu Zhang added a comment - Updated based on review comments above.
        Hide
        Thejas M Nair added a comment -

        Patch committed to trunk.
        Xuefu, thanks for the fix.

        Show
        Thejas M Nair added a comment - Patch committed to trunk. Xuefu, thanks for the fix.

          People

          • Assignee:
            Xuefu Zhang
            Reporter:
            Ankur
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development