Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None

      Description

      If schema for a field of type 'bag' is partially defined then FLATTEN() incorrectly eliminates the field and throws an error.
      Consider the following example:-

      A = LOAD 'sample' using PigStorage() as (first:chararray, second:chararray, ladder:bag{});
      B = FOREACH A GENERATE first,FLATTEN(ladder) as third,second;
      C = GROUP B by (first,third);

      This throws the error
      ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: third in

      {first: chararray,second: chararray}

        Issue Links

          Activity

          Hide
          Pradeep Kamath added a comment -

          Pig doesn't handle partial schemas well - the fix for this issue will depend on how we want to treat unknown schemas. I did verify that this works when the schema specified is complete:

          A = LOAD 'sample' using PigStorage() as (first:chararray, second:chararray, ladder:bag{t:tuple(x:int)});
          B = FOREACH A GENERATE first,FLATTEN(ladder) as third,second;
          C = GROUP B by (first,third);
          describe C;
          

          Here's the output:
          C: {group: (first: chararray,third: int),B:

          {first: chararray,third: int,second: chararray}

          }

          Show
          Pradeep Kamath added a comment - Pig doesn't handle partial schemas well - the fix for this issue will depend on how we want to treat unknown schemas. I did verify that this works when the schema specified is complete: A = LOAD 'sample' using PigStorage() as (first:chararray, second:chararray, ladder:bag{t:tuple(x: int )}); B = FOREACH A GENERATE first,FLATTEN(ladder) as third,second; C = GROUP B by (first,third); describe C; Here's the output: C: {group: (first: chararray,third: int),B: {first: chararray,third: int,second: chararray} }
          Hide
          Olga Natkovich added a comment -

          Moving out of 0.6.0 release. The right way to run this query is to specify the complete schema for the bag. We are not sure how we should be dealing with partial schemas and need to figure out the overall strategy before fixing individual issues.

          Show
          Olga Natkovich added a comment - Moving out of 0.6.0 release. The right way to run this query is to specify the complete schema for the bag. We are not sure how we should be dealing with partial schemas and need to figure out the overall strategy before fixing individual issues.
          Hide
          Alan Gates added a comment -

          In the example above, the user specified that he expects two fields to come out of the flatten of ladder. This seems equivalent to saying A = load 'ladder' as (third, second). So I propose that when users give field names (and possibly types) in an AS that is attached to a flatten Pig takes that to be the schema of the flattened data.

          Show
          Alan Gates added a comment - In the example above, the user specified that he expects two fields to come out of the flatten of ladder. This seems equivalent to saying A = load 'ladder' as (third, second). So I propose that when users give field names (and possibly types) in an AS that is attached to a flatten Pig takes that to be the schema of the flattened data.
          Hide
          Daniel Dai added a comment -

          In current trunk, the schema for B becomes:
          B:

          {first: chararray,third: bytearray,second: chararray}

          The alias for FLATTEN(ladder) is right, but we need to decide whether to mandate the type for "third" as bytearray, or the entire schema for B is unknown.

          Show
          Daniel Dai added a comment - In current trunk, the schema for B becomes: B: {first: chararray,third: bytearray,second: chararray} The alias for FLATTEN(ladder) is right, but we need to decide whether to mandate the type for "third" as bytearray, or the entire schema for B is unknown.
          Hide
          Alan Gates added a comment -

          Daniel, I don't understand the choice here. I think we agreed that if the user specifies (third, second) as the schema then we take that to mean there are two bytearray fields and we project them to guarantee this. So

          B = FOREACH A GENERATE first,FLATTEN(ladder) as third,second;
          

          will now be equivalent to

          Bprime = FOREACH A GENERATE first,FLATTEN(ladder);
          B = FOREACH Bprime GENERATE first, $1 as third, $2 as second;
          
          Show
          Alan Gates added a comment - Daniel, I don't understand the choice here. I think we agreed that if the user specifies (third, second) as the schema then we take that to mean there are two bytearray fields and we project them to guarantee this. So B = FOREACH A GENERATE first,FLATTEN(ladder) as third,second; will now be equivalent to Bprime = FOREACH A GENERATE first,FLATTEN(ladder); B = FOREACH Bprime GENERATE first, $1 as third, $2 as second;
          Hide
          Daniel Dai added a comment -

          In current trunk,

          Bprime = FOREACH A GENERATE first,FLATTEN(ladder);
          B = FOREACH Bprime GENERATE first, $1 as third, $2 as second;
          

          is equivalent to

          B = FOREACH A GENERATE first,FLATTEN(ladder) as (third,second);
          

          Which I think is right.

          Show
          Daniel Dai added a comment - In current trunk, Bprime = FOREACH A GENERATE first,FLATTEN(ladder); B = FOREACH Bprime GENERATE first, $1 as third, $2 as second; is equivalent to B = FOREACH A GENERATE first,FLATTEN(ladder) as (third,second); Which I think is right.
          Hide
          Daniel Dai added a comment -

          It is fixed on the current trunk.

          Show
          Daniel Dai added a comment - It is fixed on the current trunk.

            People

            • Assignee:
              Daniel Dai
              Reporter:
              Ankur
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development