Pig
  1. Pig
  2. PIG-1065

In-determinate behaviour of Union when there are 2 non-matching schema's

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None

      Description

      I have a script which first does a union of these schemas and then does a ORDER BY of this result.

      f1 = LOAD '1.txt' as (key:chararray, v:chararray);
      f2 = LOAD '2.txt' as (key:chararray);
      u0 = UNION f1, f2;
      describe u0;
      dump u0;
      
      u1 = ORDER u0 BY $0;
      dump u1;
      

      When I run in Map Reduce mode I get the following result:
      $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig
      ====================
      Schema for u0 unknown.
      ====================
      (1,2)
      (2,3)
      (1)
      (2)
      ====================
      org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias u1
      at org.apache.pig.PigServer.openIterator(PigServer.java:475)
      at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
      at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
      at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
      at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
      at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
      at org.apache.pig.Main.main(Main.java:397)
      ====================
      Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText
      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
      at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
      at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251)
      at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
      at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
      ====================

      When I run the same script in local mode I get a different result, as we know that local mode does not use any Hadoop Classes.
      $java -cp pig.jar org.apache.pig.Main -x local broken.pig
      ====================
      Schema for u0 unknown
      ====================
      (1,2)
      (1)
      (2,3)
      (2)
      ====================
      (1,2)
      (1)
      (2,3)
      (2)
      ====================

      Here are some questions
      1) Why do we allow union if the schemas do not match
      2) Should we not print an error message/warning so that the user knows that this is not allowed or he can get unexpected results?

      Viraj

        Issue Links

          Activity

          Hide
          Pradeep Kamath added a comment -

          This is an instance of the problem of representing unknown schema with a null schema. If the schema of a relational operator is null, pig assumes the fields are of type bytearray which is incorrect. An unknown schema really means we don't know the types of the fields. In the above case, once pig determines that the two schemas have different sizes, it sets the schema of LOUnion to null (to represent unknown schema). Hence the order by expects the fields coming out of the union to be byte arrays but in reality the first field (which is the sort key above) is a chararray - this results in a runtime exception.

          I propose that when either of the two inputs to a union have a schema we should error out if the two are incompatible and not continue. If the two inputs don't have a schema then we can proceed with null schema - thoughts?

          Show
          Pradeep Kamath added a comment - This is an instance of the problem of representing unknown schema with a null schema. If the schema of a relational operator is null, pig assumes the fields are of type bytearray which is incorrect. An unknown schema really means we don't know the types of the fields. In the above case, once pig determines that the two schemas have different sizes, it sets the schema of LOUnion to null (to represent unknown schema). Hence the order by expects the fields coming out of the union to be byte arrays but in reality the first field (which is the sort key above) is a chararray - this results in a runtime exception. I propose that when either of the two inputs to a union have a schema we should error out if the two are incompatible and not continue. If the two inputs don't have a schema then we can proceed with null schema - thoughts?
          Hide
          Thejas M Nair added a comment -

          Can this be allowed (in case of incompatible schemas as in description) - u0 = UNION f1, f2 as (key:chararray, v:chararray); ?

          Show
          Thejas M Nair added a comment - Can this be allowed (in case of incompatible schemas as in description) - u0 = UNION f1, f2 as (key:chararray, v:chararray); ?
          Hide
          Pradeep Kamath added a comment -

          Currently pig parser does not allow specifying a schema for the union. If we do want to allow it there are a few details which arise:
          1) From what I know, currently pig allowing specifying row schemas only in load statements. Is this restriction by design? ForEach allows only to give different schemas at individual field level and even there the type has to match if I recollect right.
          2) If the input schemas ae unequal in size, should the specified union schema size strictly be MAX of the input schema sizes with nulls being projected for missing columns in the input with the smaller schema? So a specified schema of size < MAX(size of input schemas) will not be allowed?
          3) Can the specified union schema have different types (castable) than the result of merging the two input schemas - for example if after merging the two input schemas if the first input has int and if specified schema has long? What about demotions like the merged schema being long and specified one being int - would those be disallowed? - I suppose we just allow whatever is allowed in casts

          Show
          Pradeep Kamath added a comment - Currently pig parser does not allow specifying a schema for the union. If we do want to allow it there are a few details which arise: 1) From what I know, currently pig allowing specifying row schemas only in load statements. Is this restriction by design? ForEach allows only to give different schemas at individual field level and even there the type has to match if I recollect right. 2) If the input schemas ae unequal in size, should the specified union schema size strictly be MAX of the input schema sizes with nulls being projected for missing columns in the input with the smaller schema? So a specified schema of size < MAX(size of input schemas) will not be allowed? 3) Can the specified union schema have different types (castable) than the result of merging the two input schemas - for example if after merging the two input schemas if the first input has int and if specified schema has long? What about demotions like the merged schema being long and specified one being int - would those be disallowed? - I suppose we just allow whatever is allowed in casts
          Hide
          Santhosh Srinivasan added a comment -

          Answer to Question 1: Pig 1.0 had that syntax and it was retained for backward compatibility. Paolo suggested that for uniformity, the 'AS' clause for the load statements should be extended to all relational operators. Gradually, the column aliasing in the foreach should be removed from the documentation and eventually removed from the language.

          Show
          Santhosh Srinivasan added a comment - Answer to Question 1: Pig 1.0 had that syntax and it was retained for backward compatibility. Paolo suggested that for uniformity, the 'AS' clause for the load statements should be extended to all relational operators. Gradually, the column aliasing in the foreach should be removed from the documentation and eventually removed from the language.
          Hide
          Alan Gates added a comment -

          As originally defined UNION does allow two inputs to be of different schema, the result of which should have no schema. So the error here is really that the runtime system doesn't adapt to the unexpected type. I'm not sure how useful this functionality is, but it is in keeping with the original spirit of Pig's no schema required, so I'm not inclined to fix it. At some point in the future we should consider how general we want to be in these cases, as there is significant cost for it. But let's decide it as a whole for the language rather than piecemeal.

          I would support the addition of a conforming union, (I have no idea what to call it) that requires that each input have either the same schema or that the two schemas be somehow compatible. It would then handle type promotion and adding nulls for missing columns.

          Finally, in response to Santhosh's comment on AS, I think whether we extend AS to statements beyond LOAD depends on choices we make above about schemas or lack thereof. But removing it from the foreach isn't easy. Consider the following:

          B = foreach A generate $0, flatten($1), $2, flatten($3) AS (a:int, b:int, c:float, d:long, e:chararray);
          

          If we don't have a schema for $1 and $3, we don't know whether $2 should end up being c:float or d:long.

          Show
          Alan Gates added a comment - As originally defined UNION does allow two inputs to be of different schema, the result of which should have no schema. So the error here is really that the runtime system doesn't adapt to the unexpected type. I'm not sure how useful this functionality is, but it is in keeping with the original spirit of Pig's no schema required, so I'm not inclined to fix it. At some point in the future we should consider how general we want to be in these cases, as there is significant cost for it. But let's decide it as a whole for the language rather than piecemeal. I would support the addition of a conforming union, (I have no idea what to call it) that requires that each input have either the same schema or that the two schemas be somehow compatible. It would then handle type promotion and adding nulls for missing columns. Finally, in response to Santhosh's comment on AS, I think whether we extend AS to statements beyond LOAD depends on choices we make above about schemas or lack thereof. But removing it from the foreach isn't easy. Consider the following: B = foreach A generate $0, flatten($1), $2, flatten($3) AS (a: int , b: int , c: float , d: long , e:chararray); If we don't have a schema for $1 and $3, we don't know whether $2 should end up being c:float or d:long.
          Hide
          Santhosh Srinivasan added a comment -

          The schema will then correspond to the prefix as it is implemented today. For example if the AS statement is define for the flatten($1) and if $1 flattens to 10 columns and if the AS clause has 3 columns then the prefix is used and the remaining are left undefined.

          Show
          Santhosh Srinivasan added a comment - The schema will then correspond to the prefix as it is implemented today. For example if the AS statement is define for the flatten($1) and if $1 flattens to 10 columns and if the AS clause has 3 columns then the prefix is used and the remaining are left undefined.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Aliasing inside foreach is hugely useful for readability. Are you suggesting removing the ability to assign aliases inside a forearch, or just to change/assign schemas?

          Show
          Dmitriy V. Ryaboy added a comment - Aliasing inside foreach is hugely useful for readability. Are you suggesting removing the ability to assign aliases inside a forearch, or just to change/assign schemas?
          Hide
          Pradeep Kamath added a comment -

          As originally defined UNION does allow two inputs to be of different schema, the result of which should have no schema.

          I think leaving the above definition of UNION as status quo will always lead to an error situation. Is there a use case for the above definition? In my opinion with the addition of types and schema into the language, we should always be strict when at least one input has a schema and require that all inputs have a schema (this would be consistent for example with https://issues.apache.org/jira/browse/PIG-1064?focusedCommentId=12775976&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12775976). In almost all cases other than an immediate store, allowing a mix of schema and no schema for inputs will result in completely unrelated error messages in the back end. Am wondering if it is better we fix these issues on a case by case basis than delay it for an overall language fix since we may lose track of the many such cases and cause more bug reports while we wait. Thoughts?

          Show
          Pradeep Kamath added a comment - As originally defined UNION does allow two inputs to be of different schema, the result of which should have no schema. I think leaving the above definition of UNION as status quo will always lead to an error situation. Is there a use case for the above definition? In my opinion with the addition of types and schema into the language, we should always be strict when at least one input has a schema and require that all inputs have a schema (this would be consistent for example with https://issues.apache.org/jira/browse/PIG-1064?focusedCommentId=12775976&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12775976 ). In almost all cases other than an immediate store, allowing a mix of schema and no schema for inputs will result in completely unrelated error messages in the back end. Am wondering if it is better we fix these issues on a case by case basis than delay it for an overall language fix since we may lose track of the many such cases and cause more bug reports while we wait. Thoughts?
          Hide
          Santhosh Srinivasan added a comment -

          Aliasing inside foreach is hugely useful for readability. Are you suggesting removing the ability to assign aliases inside a forearch, or just to change/assign schemas?

          For consistency, all relational operators should support the AS clause. Gradually, the aliasing on a per column basis in foreach should be removed from the documentation, deprecated and eventually removed. This is a long term recommendation.

          Show
          Santhosh Srinivasan added a comment - Aliasing inside foreach is hugely useful for readability. Are you suggesting removing the ability to assign aliases inside a forearch, or just to change/assign schemas? For consistency, all relational operators should support the AS clause. Gradually, the aliasing on a per column basis in foreach should be removed from the documentation, deprecated and eventually removed. This is a long term recommendation.
          Hide
          Olga Natkovich added a comment -

          I don't think we should add yet another Union especially if later we decide that the union as it is define now have no real value.

          Show
          Olga Natkovich added a comment - I don't think we should add yet another Union especially if later we decide that the union as it is define now have no real value.
          Hide
          Pradeep Kamath added a comment -

          To elaborate a little on the errors that arise, consider the script in the description of this issue. We have incompatible schemas and hence we set the schema of the union to be null (unknown). However the first field actually is a chararray as specified by the schemas. Hence there is a type mismatch in the backend (unknown schema implies all fields are bytearrays).

          Another example is something like:

          A = load 'foo' as (c:chararray);
          B = load 'bar';
          C = Union A, B;
          D = group C by $0;
          

          In the above script also, currently we assign unknown schema to C. However the key of the group by will be bytearray when it comes from B and chararray when it comes from A.

          If we have to honor schema of any of the inputs of the union, we can only do so if we can have a legitimate schema for the union - currently in all incompatible schema situations, we essentially disregard input schemas and assume all fields are bytearrays.

          Show
          Pradeep Kamath added a comment - To elaborate a little on the errors that arise, consider the script in the description of this issue. We have incompatible schemas and hence we set the schema of the union to be null (unknown). However the first field actually is a chararray as specified by the schemas. Hence there is a type mismatch in the backend (unknown schema implies all fields are bytearrays). Another example is something like: A = load 'foo' as (c:chararray); B = load 'bar'; C = Union A, B; D = group C by $0; In the above script also, currently we assign unknown schema to C. However the key of the group by will be bytearray when it comes from B and chararray when it comes from A. If we have to honor schema of any of the inputs of the union, we can only do so if we can have a legitimate schema for the union - currently in all incompatible schema situations, we essentially disregard input schemas and assume all fields are bytearrays.
          Hide
          Olga Natkovich added a comment -

          Delaying this fix till we address PIG-998

          Show
          Olga Natkovich added a comment - Delaying this fix till we address PIG-998
          Hide
          Daniel Dai added a comment -

          Fixed along with PIG-1277.

          Show
          Daniel Dai added a comment - Fixed along with PIG-1277 .

            People

            • Assignee:
              Alan Gates
              Reporter:
              Viraj Bhat
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development