Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1693

support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: impl
    • Labels:
      None
    • Release Note:
      Hide

      Project-range ( '..' ) can be used to project a range of columns from input.
      For example, the expressions -
      .. $x : projects columns $0 through $x, inclusive
      
$x .. : projects columns through end, inclusive
      
$x .. $y : projects columns through $y, inclusive
      If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).


      This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938.

      It can be used in following statements -
      - foreach
      - join
      - order (also when it is within a nested foreach block)
      - group/cogroup

      Examples -
      {code}
      grunt> F = foreach IN generate (int)col0, col1 .. col3;
      grunt> describe F;
      F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
      {code}
      {code}
      grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
      {code}
      {code}
      J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
      {code}
      {code}
      g = group l1 by b .. c;
      {code}

      Limitations:
      There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

      1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

      2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
      example-
      {code}
      grunt> describe IN;
      Schema for IN unknown.

      -- Following statement is supported
      SORT = order IN by $2 .. $3, $6 ..;

      -- Following statement is NOT supported
      SORT = order IN by $2 .. $3, $6 ..;
      {code}

      Show
      Project-range ( '..' ) can be used to project a range of columns from input. For example, the expressions - .. $x : projects columns $0 through $x, inclusive 
$x .. : projects columns through end, inclusive 
$x .. $y : projects columns through $y, inclusive If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid). This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938 . It can be used in following statements - - foreach - join - order (also when it is within a nested foreach block) - group/cogroup Examples - {code} grunt> F = foreach IN generate (int)col0, col1 .. col3; grunt> describe F; F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray} {code} {code} grunt> SORT = order IN by col2 .. col3, col0, col4 ..; {code} {code} J = join IN1 by $0 .. $3, IN2 by $0 .. $3; {code} {code} g = group l1 by b .. c; {code} Limitations: There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted. 1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema 2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null. example- {code} grunt> describe IN; Schema for IN unknown. -- Following statement is supported SORT = order IN by $2 .. $3, $6 ..; -- Following statement is NOT supported SORT = order IN by $2 .. $3, $6 ..; {code}

      Description

      A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:

      ...
      Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
      store Z into 'output';
      

      Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:

      ...
      Z = foreach Y generate (int)firstcol, "and all the rest";
      store Z into 'output'
      

        Attachments

        1. PIG-1693.1.patch
          178 kB
          Thejas M Nair
        2. PIG-1693.2.patch
          178 kB
          Thejas M Nair

          Issue Links

            Activity

              People

              • Assignee:
                thejas Thejas M Nair
                Reporter:
                alangates Alan Gates
              • Votes:
                2 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: