[PIG-1693] support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" ) - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.9.0
Component/s: impl
Labels:
None

Release Note:

Hide

Project-range ( '..' ) can be used to project a range of columns from input.
For example, the expressions -
.. $x : projects columns $0 through $x, inclusive
 $x .. : projects columns through end, inclusive
 $x .. $y : projects columns through $y, inclusive
If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).

This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in ~~PIG-1938~~.

It can be used in following statements -
- foreach
- join
- order (also when it is within a nested foreach block)
- group/cogroup

Examples -
{code}
grunt> F = foreach IN generate (int)col0, col1 .. col3;
grunt> describe F;
F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
{code}
{code}
grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
{code}
{code}
J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
{code}
{code}
g = group l1 by b .. c;
{code}

Limitations:
There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
example-
{code}
grunt> describe IN;
Schema for IN unknown.

-- Following statement is supported
SORT = order IN by $2 .. $3, $6 ..;

-- Following statement is NOT supported
SORT = order IN by $2 .. $3, $6 ..;
{code}

Show
Project-range ( '..' ) can be used to project a range of columns from input. For example, the expressions - .. $x : projects columns $0 through $x, inclusive  $x .. : projects columns through end, inclusive  $x .. $y : projects columns through $y, inclusive If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid). This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938 . It can be used in following statements - - foreach - join - order (also when it is within a nested foreach block) - group/cogroup Examples - {code} grunt> F = foreach IN generate (int)col0, col1 .. col3; grunt> describe F; F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray} {code} {code} grunt> SORT = order IN by col2 .. col3, col0, col4 ..; {code} {code} J = join IN1 by $0 .. $3, IN2 by $0 .. $3; {code} {code} g = group l1 by b .. c; {code} Limitations: There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted. 1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema 2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null. example- {code} grunt> describe IN; Schema for IN unknown. -- Following statement is supported SORT = order IN by $2 .. $3, $6 ..; -- Following statement is NOT supported SORT = order IN by $2 .. $3, $6 ..; {code}

Description

A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:

...
Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
store Z into 'output';

Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:

...
Z = foreach Y generate (int)firstcol, "and all the rest";
store Z into 'output'

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-1693.2.patch
28/Mar/11 12:07
178 kB
Thejas Nair
PIG-1693.1.patch
25/Mar/11 18:18
178 kB
Thejas Nair

Issue Links

relates to

PIG-2511 Enable '*' to skip any fields that have already been generated and cast in other parts of the GENERATE, as in: foo = FOREACH my_relation GENERATE manipulate(foo1) as foo1, *;

Open

support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates