Pig
  1. Pig
  2. PIG-1693

support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: impl
    • Labels:
      None
    • Release Note:
      Hide

      Project-range ( '..' ) can be used to project a range of columns from input.
      For example, the expressions -
      .. $x : projects columns $0 through $x, inclusive
      
$x .. : projects columns through end, inclusive
      
$x .. $y : projects columns through $y, inclusive
      If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).


      This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938.

      It can be used in following statements -
      - foreach
      - join
      - order (also when it is within a nested foreach block)
      - group/cogroup

      Examples -
      {code}
      grunt> F = foreach IN generate (int)col0, col1 .. col3;
      grunt> describe F;
      F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
      {code}
      {code}
      grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
      {code}
      {code}
      J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
      {code}
      {code}
      g = group l1 by b .. c;
      {code}

      Limitations:
      There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

      1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

      2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
      example-
      {code}
      grunt> describe IN;
      Schema for IN unknown.

      -- Following statement is supported
      SORT = order IN by $2 .. $3, $6 ..;

      -- Following statement is NOT supported
      SORT = order IN by $2 .. $3, $6 ..;
      {code}

      Show
      Project-range ( '..' ) can be used to project a range of columns from input. For example, the expressions - .. $x : projects columns $0 through $x, inclusive 
$x .. : projects columns through end, inclusive 
$x .. $y : projects columns through $y, inclusive If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid). This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938 . It can be used in following statements - - foreach - join - order (also when it is within a nested foreach block) - group/cogroup Examples - {code} grunt> F = foreach IN generate (int)col0, col1 .. col3; grunt> describe F; F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray} {code} {code} grunt> SORT = order IN by col2 .. col3, col0, col4 ..; {code} {code} J = join IN1 by $0 .. $3, IN2 by $0 .. $3; {code} {code} g = group l1 by b .. c; {code} Limitations: There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted. 1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema 2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null. example- {code} grunt> describe IN; Schema for IN unknown. -- Following statement is supported SORT = order IN by $2 .. $3, $6 ..; -- Following statement is NOT supported SORT = order IN by $2 .. $3, $6 ..; {code}

      Description

      A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:

      ...
      Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
      store Z into 'output';
      

      Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:

      ...
      Z = foreach Y generate (int)firstcol, "and all the rest";
      store Z into 'output'
      
      1. PIG-1693.2.patch
        178 kB
        Thejas M Nair
      2. PIG-1693.1.patch
        178 kB
        Thejas M Nair

        Issue Links

          Activity

          Hide
          Thejas M Nair added a comment -

          Not sure how this will interact with JOIN and others (which was one the rationale for forced project I guess ?),

          "$3.." (ie project-range-to-end) without schema, will work with join, but not with group or co-group. (This limitation is documented in release notes of this jira).

          Show
          Thejas M Nair added a comment - Not sure how this will interact with JOIN and others (which was one the rationale for forced project I guess ?), "$3.." (ie project-range-to-end) without schema, will work with join, but not with group or co-group. (This limitation is documented in release notes of this jira).
          Hide
          Mridul Muralidharan added a comment -

          Thanks for clarifying Thejas !
          Not sure how this will interact with JOIN and others (which was one the rationale for forced project I guess ?), but this perfectly fits our usecases - along with a few in coke I guess.

          • Mridul
          Show
          Mridul Muralidharan added a comment - Thanks for clarifying Thejas ! Not sure how this will interact with JOIN and others (which was one the rationale for forced project I guess ?), but this perfectly fits our usecases - along with a few in coke I guess. Mridul
          Hide
          Thejas M Nair added a comment -

          a) $3.. works for an unspecified number of columns when there is no load schema ?

          Yes, "$3 .." works for unspecified number of columns.
          This is similar to the way project-star ("*") works without input schema. Since pig does not know how many columns would be there, the expansion happens at runtime. In all other cases, the expansion of the project-range expression happens is done before query plan is generated.

          b) or, $3..$MAX is required ? (so we should be schema aware).

          No, this is not required.

          Show
          Thejas M Nair added a comment - a) $3.. works for an unspecified number of columns when there is no load schema ? Yes, "$3 .." works for unspecified number of columns. This is similar to the way project-star ("*") works without input schema. Since pig does not know how many columns would be there, the expansion happens at runtime. In all other cases, the expansion of the project-range expression happens is done before query plan is generated. b) or, $3..$MAX is required ? (so we should be schema aware). No, this is not required.
          Hide
          Mridul Muralidharan added a comment -

          I am not sure what the comment means - do you mean (in the example above) :
          a) $3.. works for an unspecified number of columns when there is no load schema ?
          b) or, $3..$MAX is required ? (so we should be schema aware).

          Or do you simply mean '..' works when there is no loader schema (which I assumed it would anyway) without commenting on the actual usecase I refer to above ?

          Thanks,
          Mridul

          Show
          Mridul Muralidharan added a comment - I am not sure what the comment means - do you mean (in the example above) : a) $3.. works for an unspecified number of columns when there is no load schema ? b) or, $3..$MAX is required ? (so we should be schema aware). Or do you simply mean '..' works when there is no loader schema (which I assumed it would anyway) without commenting on the actual usecase I refer to above ? Thanks, Mridul
          Hide
          Daniel Dai added a comment -

          Yes, range projection works without schema as well.

          Show
          Daniel Dai added a comment - Yes, range projection works without schema as well.
          Hide
          Mridul Muralidharan added a comment -

          This is a great feature addition.
          Hopefully, the mess created by forcefully projecting only the fields referenced in the schema/schema(when there is no schema specified) can be allevated without needing dummy schema with 10+ fields at times (atleast, it will make it easier I hope) !

          Just curious about one aspect.
          If you do something like :

          A = LOAD '<path>' USING MyLoader();
          B = FOREACH A $0, $3..;
          STORE B USING MyStore();

          Do we still need a schema to 'con' pig into projecting all the fields ? This is particularly relevant when the number of fields is high (or might be 'fuzzy' at times.)
          An earlier version of pig (still ?), introduced an implicit project which forced projection of only the referenced fields (in case the schema not specified) or strictly adhere to specified schema - dropping rest of the fields from tuple.

          Atleast with this change, I hope, we can do something like this to alleviate the issue :

          A = LOAD '<path>' USING MyLoader();
          B = FOREACH A $0, $3..$64;
          STORE B USING MyStore();

          Thanks for clarifying.

          Show
          Mridul Muralidharan added a comment - This is a great feature addition. Hopefully, the mess created by forcefully projecting only the fields referenced in the schema/schema(when there is no schema specified) can be allevated without needing dummy schema with 10+ fields at times (atleast, it will make it easier I hope) ! Just curious about one aspect. If you do something like : A = LOAD '<path>' USING MyLoader(); B = FOREACH A $0, $3..; STORE B USING MyStore(); Do we still need a schema to 'con' pig into projecting all the fields ? This is particularly relevant when the number of fields is high (or might be 'fuzzy' at times.) An earlier version of pig (still ?), introduced an implicit project which forced projection of only the referenced fields (in case the schema not specified) or strictly adhere to specified schema - dropping rest of the fields from tuple. Atleast with this change, I hope, we can do something like this to alleviate the issue : A = LOAD '<path>' USING MyLoader(); B = FOREACH A $0, $3..$64; STORE B USING MyStore(); Thanks for clarifying.
          Hide
          Thejas M Nair added a comment -

          Patch committed to trunk.

          Show
          Thejas M Nair added a comment - Patch committed to trunk.
          Hide
          Thejas M Nair added a comment -

          PIG-1693.2.patch - addressing review comments.
          Unit tests pass.
          Test-patch results -
          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 15 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] -1 javac. The applied patch generated 958 javac compiler warnings (more than the trunk's current 941 warnings).
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

          The additional javac warnings are from code generated by antlr.

          Show
          Thejas M Nair added a comment - PIG-1693 .2.patch - addressing review comments. Unit tests pass. Test-patch results - [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 15 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] -1 javac. The applied patch generated 958 javac compiler warnings (more than the trunk's current 941 warnings). [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. The additional javac warnings are from code generated by antlr.
          Hide
          Thejas M Nair added a comment -

          3. An observation. ProjectExpression class seems getting a little overloaded. We might need to consider subclass it to take care of STAR, RANGE, etc, though it doesn't have to happen now.

          I will re-examine the design when I work on PIG-1938, which adds support for project-range as udf argument.

          Show
          Thejas M Nair added a comment - 3. An observation. ProjectExpression class seems getting a little overloaded. We might need to consider subclass it to take care of STAR, RANGE, etc, though it doesn't have to happen now. I will re-examine the design when I work on PIG-1938 , which adds support for project-range as udf argument.
          Hide
          Daniel Dai added a comment -

          One minor comment, it is better to change ProjectExpression.toString to print in format [x..y], [..y], [x..] for range, which consistent with the grammar.

          Show
          Daniel Dai added a comment - One minor comment, it is better to change ProjectExpression.toString to print in format [x..y] , [..y] , [x..] for range, which consistent with the grammar.
          Hide
          Daniel Dai added a comment -

          +1 for the other part (non parser part) of the patch.

          Show
          Daniel Dai added a comment - +1 for the other part (non parser part) of the patch.
          Hide
          Xuefu Zhang added a comment -

          I have reviewed the parser related changes:

          1. in LogicalPlanGenerator.g
          $expr = builder.buildRangeProjectExpr(
          loc, plan, $GScope::currentOp,
          $statement::inputIndex,
          startExpr == null ? null : startExpr.expr,
          endExpr == null ? null : endExpr.expr
          );

          instead of startExpr == null ? null : startExpr.expr, just use $startExpr.expr.

          2. LogicalPlanBuilder.java
          try

          { plan.removeAndReconnect(startExpr); plan.removeAndReconnect(endExpr); }

          catch (FrontendException e)

          { throw new ParserValidationException(intStream, loc, e); }

          It is probably better to check if startExpr and endExpr are null.

          3. An observation. ProjectExpression class seems getting a little overloaded. We might need to consider subclass it to take care of STAR, RANGE, etc, though it doesn't have to happen now.

          Show
          Xuefu Zhang added a comment - I have reviewed the parser related changes: 1. in LogicalPlanGenerator.g $expr = builder.buildRangeProjectExpr( loc, plan, $GScope::currentOp, $statement::inputIndex, startExpr == null ? null : startExpr.expr, endExpr == null ? null : endExpr.expr ); instead of startExpr == null ? null : startExpr.expr, just use $startExpr.expr. 2. LogicalPlanBuilder.java try { plan.removeAndReconnect(startExpr); plan.removeAndReconnect(endExpr); } catch (FrontendException e) { throw new ParserValidationException(intStream, loc, e); } It is probably better to check if startExpr and endExpr are null. 3. An observation. ProjectExpression class seems getting a little overloaded. We might need to consider subclass it to take care of STAR, RANGE, etc, though it doesn't have to happen now.
          Hide
          Thejas M Nair added a comment -

          PIG-1693.1.patch
          Highlights -

          • ProjectExpression in logical plan now supports project-range
          • ProjectStarExpander is called from LogicalPlanBuilder while building foreach,group,join or sort expression plans, to expand the project-range expression.
          • ProjectStarExpander expands all project-range expressions, except project-to-end (eg. $5 ..) when input schema is null. This is the only case when project-range expression is seen by logical optimizers or the physical plan.
          • Some of the logical optimizer rules have changed to consider project-to-end use cases.
          • POProject supports project-to-end expression, and project-star is a special case of project-to-end.
          • MRCompiler and some MR optimizer rules have changed to handle project-to-end case of POProject
          Show
          Thejas M Nair added a comment - PIG-1693 .1.patch Highlights - ProjectExpression in logical plan now supports project-range ProjectStarExpander is called from LogicalPlanBuilder while building foreach,group,join or sort expression plans, to expand the project-range expression. ProjectStarExpander expands all project-range expressions, except project-to-end (eg. $5 ..) when input schema is null. This is the only case when project-range expression is seen by logical optimizers or the physical plan. Some of the logical optimizer rules have changed to consider project-to-end use cases. POProject supports project-to-end expression, and project-star is a special case of project-to-end. MRCompiler and some MR optimizer rules have changed to handle project-to-end case of POProject
          Hide
          Thejas M Nair added a comment -

          If this doesn't work with named aliases, its almost useless for me. Numbered references are not maintainable,

          Alan's proposal in his comment dated '26/Oct/10 16:27' works with named aliases as well.
          I am planning to go work on that proposal.

          The use of "*" is supported in cogroup, order-by and join statements as well, so I am planning to keep it consistent and support this syntax in those statements as well.

          *+ would mean "all columns not referenced"

          In this initial implementation I am planning to support only 'all columns in range'. If there is enough interest for 'all columns not referenced' feature that can be added later.

          Show
          Thejas M Nair added a comment - If this doesn't work with named aliases, its almost useless for me. Numbered references are not maintainable, Alan's proposal in his comment dated '26/Oct/10 16:27' works with named aliases as well. I am planning to go work on that proposal. The use of "*" is supported in cogroup, order-by and join statements as well, so I am planning to keep it consistent and support this syntax in those statements as well. *+ would mean "all columns not referenced" In this initial implementation I am planning to support only 'all columns in range'. If there is enough interest for 'all columns not referenced' feature that can be added later.
          Hide
          Eric Yang added a comment -

          *+ and *- could have potential readability problems. It is easy to confuse user with mathematical operation at first glance. I think using ".." would be better choice.

          It should be possible to write as:

          Z = foreach Y generate myUDF(firstcol, secondcol, thirdcol) as result, forthcol .. tenthcol;
          Z = foreach Y generate firstcol, forthcol .. tenthcol;
          

          Another approach, It could be written as UDF style.

          Z = foreach Y generate myUDF(firstcol, secondcol, thirdcol) as result, mirror(forthcol, tenthcol);
          Z = foreach Y generate firstcol, mirror(forthcol, thenthcol);
          
          Show
          Eric Yang added a comment - *+ and *- could have potential readability problems. It is easy to confuse user with mathematical operation at first glance. I think using ".." would be better choice. It should be possible to write as: Z = foreach Y generate myUDF(firstcol, secondcol, thirdcol) as result, forthcol .. tenthcol; Z = foreach Y generate firstcol, forthcol .. tenthcol; Another approach, It could be written as UDF style. Z = foreach Y generate myUDF(firstcol, secondcol, thirdcol) as result, mirror(forthcol, tenthcol); Z = foreach Y generate firstcol, mirror(forthcol, thenthcol);
          Hide
          Scott Carey added a comment -

          If this doesn't work with named aliases, its almost useless for me. Numbered references are not maintainable, what happens when you add a column to a complex flow? Or if you remove one? suddenly you are adding numbers to statements or decrementing numbers all over the place.

          Y has 10 named columns, with full schemas.

          Use case 1, operate on subset:

          Z = foreach Y generate myUDF(firstcol, secondcol, thridcol) as result, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
          

          Use case 2, remove a subset:

          Z = foreach Y generate firstcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
          

          Why not just make the * operator have a few different forms or use a new operator?

          Use case 1 becomes:

          Z = foreach Y generate myUDF(firstcol, secondcol, thridcol) as result, *+;
          

          *+ would mean "all columns not referenced"

          Use case 2 becomes:

          Z = foreach Y generate  *- (secondcol, thirdcol);
          

          and *- generates all columns other than the set right after it.

          I'm not saying these are the best operators or syntax, but syntax that did not involve number ranges and simply 'works' for 'generate all that have not been referenced' and 'generate all excluding (set of aliases)' would be awesome. I definitely don't want to be counting aliases to discover that fieldFoo is the 23rd alias and fieldBar is the 29th.

          There is a lot of problems with ranges combined with names. And you still have to keep track of the count of columns which isn't fun when there are 40. A "shared" alias uses names so that scripts that consume it never has to change if the alias adds columns, or if it removes columns only scripts that used that field has to change.

          Show
          Scott Carey added a comment - If this doesn't work with named aliases, its almost useless for me. Numbered references are not maintainable, what happens when you add a column to a complex flow? Or if you remove one? suddenly you are adding numbers to statements or decrementing numbers all over the place. Y has 10 named columns, with full schemas. Use case 1, operate on subset: Z = foreach Y generate myUDF(firstcol, secondcol, thridcol) as result, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol; Use case 2, remove a subset: Z = foreach Y generate firstcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol; Why not just make the * operator have a few different forms or use a new operator? Use case 1 becomes: Z = foreach Y generate myUDF(firstcol, secondcol, thridcol) as result, *+; *+ would mean "all columns not referenced" Use case 2 becomes: Z = foreach Y generate *- (secondcol, thirdcol); and *- generates all columns other than the set right after it. I'm not saying these are the best operators or syntax, but syntax that did not involve number ranges and simply 'works' for 'generate all that have not been referenced' and 'generate all excluding (set of aliases)' would be awesome. I definitely don't want to be counting aliases to discover that fieldFoo is the 23rd alias and fieldBar is the 29th. There is a lot of problems with ranges combined with names. And you still have to keep track of the count of columns which isn't fun when there are 40. A "shared" alias uses names so that scripts that consume it never has to change if the alias adds columns, or if it removes columns only scripts that used that field has to change.
          Hide
          Milind Bhandarkar added a comment -

          +1 to Alan's last comment.

          Show
          Milind Bhandarkar added a comment - +1 to Alan's last comment.
          Hide
          Alan Gates added a comment -

          If we go with "..", then can we mandate that both the beginning and end indexes are mandatory ? That will avoid the ambiguity in your last example.

          As you suggested above, I think we should support 3 cases:

          ..$x – $0 through $x, inclusive
          $x.. – $x through end, inclusive
          $x..$y – $x through $y, inclusive

          The one change I made from your syntax is keeping the '$' attached to the positional variables, because this should be legal by alias too. So if one has a schema (alpha, beta, gamma, delta, epsilon)

          ..gamma
          gamma..
          beta..delta

          would all be legal too.

          Show
          Alan Gates added a comment - If we go with "..", then can we mandate that both the beginning and end indexes are mandatory ? That will avoid the ambiguity in your last example. As you suggested above, I think we should support 3 cases: ..$x – $0 through $x, inclusive $x.. – $x through end, inclusive $x..$y – $x through $y, inclusive The one change I made from your syntax is keeping the '$' attached to the positional variables, because this should be legal by alias too. So if one has a schema (alpha, beta, gamma, delta, epsilon) ..gamma gamma.. beta..delta would all be legal too.
          Hide
          Santhosh Srinivasan added a comment -

          Please ignore my comment. I was thinking about the use of handling 'n' columns in a record of size 'm' where m >> n

          Show
          Santhosh Srinivasan added a comment - Please ignore my comment. I was thinking about the use of handling 'n' columns in a record of size 'm' where m >> n
          Hide
          Alan Gates added a comment -

          Santhosh, I don't see how drop meets the use case. I want to cast one column and leave all the rest the same. I don't want to drop it.

          Show
          Alan Gates added a comment - Santhosh, I don't see how drop meets the use case. I want to cast one column and leave all the rest the same. I don't want to drop it.
          Hide
          Milind Bhandarkar added a comment -

          If we go with "..", then can we mandate that both the beginning and end indexes are mandatory ? That will avoid the ambiguity in your last example.

          Show
          Milind Bhandarkar added a comment - If we go with "..", then can we mandate that both the beginning and end indexes are mandatory ? That will avoid the ambiguity in your last example.
          Hide
          Alan Gates added a comment -

          The point that '...' is used for varargs and thus may be confusing is a valid one. Perhaps '..' would be a better choice since it is used in both Perl and Ruby. I still don't like ':'.

          Whichever one we choose, syntax and semantics (as suggested by Olga and Milind) seem good.

          Show
          Alan Gates added a comment - The point that '...' is used for varargs and thus may be confusing is a valid one. Perhaps '..' would be a better choice since it is used in both Perl and Ruby. I still don't like ':'. Whichever one we choose, syntax and semantics (as suggested by Olga and Milind) seem good.
          Hide
          Milind Bhandarkar added a comment -

          Talked to Olga and Thejas offline. Told them my reservations about "...".
          Ranges are a well-established concepts in scripting languages.
          For example, Perl array slicing uses "..", Python uses ":".
          ... is used for varargs, which means any number of arguments, and does not define a range.

          So, ".." (notice, two dots, not three) can be considered.

          Basically, a range is specified by a beginning and an end.
          If beginning is omitted, then 0 is assumed.
          If end is omitted, then max_index(range) is assumed.
          If we use ':', then omitting beginning or end does not look odd as ".."

          To give you an example, if I want to specify all fields after 3, there are two choices.

          $4.., or $4:

          If I want to specify all the fields upto field 6,

          $..6, ot $:6

          If I want to specify fields between 3 and 10,

          $3..10 or $3:10.

          Please choose between .. and :.

          Show
          Milind Bhandarkar added a comment - Talked to Olga and Thejas offline. Told them my reservations about "...". Ranges are a well-established concepts in scripting languages. For example, Perl array slicing uses "..", Python uses ":". ... is used for varargs, which means any number of arguments, and does not define a range. So, ".." (notice, two dots, not three) can be considered. Basically, a range is specified by a beginning and an end. If beginning is omitted, then 0 is assumed. If end is omitted, then max_index(range) is assumed. If we use ':', then omitting beginning or end does not look odd as ".." To give you an example, if I want to specify all fields after 3, there are two choices. $4.., or $4: If I want to specify all the fields upto field 6, $..6, ot $:6 If I want to specify fields between 3 and 10, $3..10 or $3:10. Please choose between .. and :.
          Hide
          Milind Bhandarkar added a comment -

          Is there a pig philosphy stated somewhere to make pig a "write-only" language ?

          Does anyone else feel that putting ... in the statements looks like you are omitting irrrelevant stuff ?

          Show
          Milind Bhandarkar added a comment - Is there a pig philosphy stated somewhere to make pig a "write-only" language ? Does anyone else feel that putting ... in the statements looks like you are omitting irrrelevant stuff ?
          Hide
          Santhosh Srinivasan added a comment -

          Why don't we add a drop columns feature? Then we could do the following for the use case stated in the ticket description.

          Z = foreach Y drop a, b, c;
          Z1 = foreach Z generate *; 
          
          Show
          Santhosh Srinivasan added a comment - Why don't we add a drop columns feature? Then we could do the following for the use case stated in the ticket description. Z = foreach Y drop a, b, c; Z1 = foreach Z generate *;
          Hide
          Olga Natkovich added a comment -

          I like .... as well. In the foreach ambiguous foreach example, I would suggest that we require the user to provide start and end rather than making our own rules.

          Show
          Olga Natkovich added a comment - I like .... as well. In the foreach ambiguous foreach example, I would suggest that we require the user to provide start and end rather than making our own rules.
          Hide
          Milind Bhandarkar added a comment -

          I prefer colon. (it's one keystroke, instead of three you propose), it can represent ranges vey well, and without any ambiguity.

          e.g. $:4, $5:6, $7:

          $:n = 0..n
          $m:n = m..n
          $n: = n..end

          Show
          Milind Bhandarkar added a comment - I prefer colon. (it's one keystroke, instead of three you propose), it can represent ranges vey well, and without any ambiguity. e.g. $:4, $5:6, $7: $:n = 0..n $m:n = m..n $n: = n..end
          Hide
          Alan Gates added a comment -

          I can see a couple of ways of approaching this.

          One would be something like the colon operator in Python, meaning everything in between. As colon is not widely used for this across programming languages, I propose '...' instead, since that is the natural language meaning of ellipses. If it was used before a certain field it would mean the beginning up to that field:

          B = foreach A generate ..., $10;
          

          would mean $0-$9

          If used between two fields, it would mean everything in between:

          B = foreach A generate $7, ..., $10;
          

          would mean $8 and $9.

          If used at the end of the line, it would mean everything after the last referenced field:

          B = foreach A generate $10, ...;
          

          would mean $11 to the end of the record.

          Another approach would be to define a symbol that means "all fields not referenced in this list of expressions". If, for
          example, we chose @ to mean this, then:

          B = foreach A generate $10, @;
          

          would mean $0-$9, and $11 to the end.

          Then does $10 keep its place as the eleventh field or become the first field?

          I like the '...' option better, as it allows more control of ordering and will be easier for users to understand.

          Whichever one we choose we have to answer what it means if an expression contains more than one field:

          B = foreach A generate udf($3, $5), ..., udf($8, $10);
          

          What range does '...' include? I propose it includes the highest column number on the left and the lowest on the right (thus in this example, $6 and $7).

          In the @ case it's clear that @ would refer to $0, $1, $2, $4, $6, $7, $9, and anything past $10. But the ordering becomes even stickier. Where do $4 and $9 go?

          In cases where Pig knows the schema, the '...' or '@' operator could be resolved at compile time. This will be more efficient. In cases where it does not, an new physical operator would be required to handle the @ or ellipse end case "$1, ..." as we cannot construct a set of projections that knows exactly which columns to pass through.

          Show
          Alan Gates added a comment - I can see a couple of ways of approaching this. One would be something like the colon operator in Python, meaning everything in between. As colon is not widely used for this across programming languages, I propose '...' instead, since that is the natural language meaning of ellipses. If it was used before a certain field it would mean the beginning up to that field: B = foreach A generate ..., $10; would mean $0-$9 If used between two fields, it would mean everything in between: B = foreach A generate $7, ..., $10; would mean $8 and $9. If used at the end of the line, it would mean everything after the last referenced field: B = foreach A generate $10, ...; would mean $11 to the end of the record. Another approach would be to define a symbol that means "all fields not referenced in this list of expressions". If, for example, we chose @ to mean this, then: B = foreach A generate $10, @; would mean $0-$9, and $11 to the end. Then does $10 keep its place as the eleventh field or become the first field? I like the '...' option better, as it allows more control of ordering and will be easier for users to understand. Whichever one we choose we have to answer what it means if an expression contains more than one field: B = foreach A generate udf($3, $5), ..., udf($8, $10); What range does '...' include? I propose it includes the highest column number on the left and the lowest on the right (thus in this example, $6 and $7). In the @ case it's clear that @ would refer to $0, $1, $2, $4, $6, $7, $9, and anything past $10. But the ordering becomes even stickier. Where do $4 and $9 go? In cases where Pig knows the schema, the '...' or '@' operator could be resolved at compile time. This will be more efficient. In cases where it does not, an new physical operator would be required to handle the @ or ellipse end case "$1, ..." as we cannot construct a set of projections that knows exactly which columns to pass through.

            People

            • Assignee:
              Thejas M Nair
              Reporter:
              Alan Gates
            • Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development