Pig
  1. Pig
  2. PIG-1693

support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: impl
    • Labels:
      None
    • Release Note:
      Hide

      Project-range ( '..' ) can be used to project a range of columns from input.
      For example, the expressions -
      .. $x : projects columns $0 through $x, inclusive
      
$x .. : projects columns through end, inclusive
      
$x .. $y : projects columns through $y, inclusive
      If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).


      This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938.

      It can be used in following statements -
      - foreach
      - join
      - order (also when it is within a nested foreach block)
      - group/cogroup

      Examples -
      {code}
      grunt> F = foreach IN generate (int)col0, col1 .. col3;
      grunt> describe F;
      F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
      {code}
      {code}
      grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
      {code}
      {code}
      J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
      {code}
      {code}
      g = group l1 by b .. c;
      {code}

      Limitations:
      There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

      1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

      2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
      example-
      {code}
      grunt> describe IN;
      Schema for IN unknown.

      -- Following statement is supported
      SORT = order IN by $2 .. $3, $6 ..;

      -- Following statement is NOT supported
      SORT = order IN by $2 .. $3, $6 ..;
      {code}

      Show
      Project-range ( '..' ) can be used to project a range of columns from input. For example, the expressions - .. $x : projects columns $0 through $x, inclusive 
$x .. : projects columns through end, inclusive 
$x .. $y : projects columns through $y, inclusive If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid). This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938 . It can be used in following statements - - foreach - join - order (also when it is within a nested foreach block) - group/cogroup Examples - {code} grunt> F = foreach IN generate (int)col0, col1 .. col3; grunt> describe F; F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray} {code} {code} grunt> SORT = order IN by col2 .. col3, col0, col4 ..; {code} {code} J = join IN1 by $0 .. $3, IN2 by $0 .. $3; {code} {code} g = group l1 by b .. c; {code} Limitations: There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted. 1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema 2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null. example- {code} grunt> describe IN; Schema for IN unknown. -- Following statement is supported SORT = order IN by $2 .. $3, $6 ..; -- Following statement is NOT supported SORT = order IN by $2 .. $3, $6 ..; {code}

      Description

      A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:

      ...
      Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
      store Z into 'output';
      

      Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:

      ...
      Z = foreach Y generate (int)firstcol, "and all the rest";
      store Z into 'output'
      
      1. PIG-1693.2.patch
        178 kB
        Thejas M Nair
      2. PIG-1693.1.patch
        178 kB
        Thejas M Nair

        Issue Links

          Activity

          Alan Gates created issue -
          Olga Natkovich made changes -
          Field Original Value New Value
          Fix Version/s 0.9.0 [ 12315191 ]
          Assignee Daniel Dai [ daijy ]
          Olga Natkovich made changes -
          Assignee Daniel Dai [ daijy ] Thejas M Nair [ thejas ]
          Thejas M Nair made changes -
          Summary There needs to be a way in foreach to indicate "and all the rest of the fields" support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
          Thejas M Nair made changes -
          Attachment PIG-1693.1.patch [ 12474642 ]
          Thejas M Nair made changes -
          Attachment PIG-1693.2.patch [ 12474769 ]
          Thejas M Nair made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Release Note
          Project-range ( '..' ) can be used to project a range of columns from input.
          For example, the expressions -
          ..$x : projects columns $0 through $x, inclusive
$x.. : projects columns through end, inclusive
$x..$y : projects columns through $y, inclusive
          If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).


          This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938.

          It can be used in following statements -
          - foreach
          - join
          - order (also when it is within a nested foreach block)
          - group/cogroup

          Examples -
          {code}
          grunt> F = foreach IN generate (int)col0, col1 .. col3;
          grunt> describe F;
          F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
          {code}
          {code}
          grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
          {code}
          {code}
          J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
          {code}
          {code}
          g = group l1 by b .. c;
          {code}

          Limitations:
          There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

          1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

          2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
          Note: there is a bug PIG-1939, because of which the use is restricted when schema is present. That should be fixed soon.
          example-
          {code}
          grunt> describe IN;
          Schema for IN unknown.

          -- Following statement is supported
          SORT = order IN by $2 .. $3, $6 ..;

          -- Following statement is NOT supported
          SORT = order IN by $2 .. $3, $6 ..;
          {code}

          Resolution Fixed [ 1 ]
          Thejas M Nair made changes -
          Release Note
          Project-range ( '..' ) can be used to project a range of columns from input.
          For example, the expressions -
          ..$x : projects columns $0 through $x, inclusive
$x.. : projects columns through end, inclusive
$x..$y : projects columns through $y, inclusive
          If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).


          This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938.

          It can be used in following statements -
          - foreach
          - join
          - order (also when it is within a nested foreach block)
          - group/cogroup

          Examples -
          {code}
          grunt> F = foreach IN generate (int)col0, col1 .. col3;
          grunt> describe F;
          F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
          {code}
          {code}
          grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
          {code}
          {code}
          J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
          {code}
          {code}
          g = group l1 by b .. c;
          {code}

          Limitations:
          There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

          1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

          2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
          Note: there is a bug PIG-1939, because of which the use is restricted when schema is present. That should be fixed soon.
          example-
          {code}
          grunt> describe IN;
          Schema for IN unknown.

          -- Following statement is supported
          SORT = order IN by $2 .. $3, $6 ..;

          -- Following statement is NOT supported
          SORT = order IN by $2 .. $3, $6 ..;
          {code}


          Project-range ( '..' ) can be used to project a range of columns from input.
          For example, the expressions -
          .. $x : projects columns $0 through $x, inclusive
          
$x .. : projects columns through end, inclusive
          
$x .. $y : projects columns through $y, inclusive
          If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).


          This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938.

          It can be used in following statements -
          - foreach
          - join
          - order (also when it is within a nested foreach block)
          - group/cogroup

          Examples -
          {code}
          grunt> F = foreach IN generate (int)col0, col1 .. col3;
          grunt> describe F;
          F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
          {code}
          {code}
          grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
          {code}
          {code}
          J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
          {code}
          {code}
          g = group l1 by b .. c;
          {code}

          Limitations:
          There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

          1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

          2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
          Note: there is a bug PIG-1939, because of which the use is restricted when schema is present. That should be fixed soon.
          example-
          {code}
          grunt> describe IN;
          Schema for IN unknown.

          -- Following statement is supported
          SORT = order IN by $2 .. $3, $6 ..;

          -- Following statement is NOT supported
          SORT = order IN by $2 .. $3, $6 ..;
          {code}

          Thejas M Nair made changes -
          Release Note
          Project-range ( '..' ) can be used to project a range of columns from input.
          For example, the expressions -
          .. $x : projects columns $0 through $x, inclusive
          
$x .. : projects columns through end, inclusive
          
$x .. $y : projects columns through $y, inclusive
          If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).


          This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938.

          It can be used in following statements -
          - foreach
          - join
          - order (also when it is within a nested foreach block)
          - group/cogroup

          Examples -
          {code}
          grunt> F = foreach IN generate (int)col0, col1 .. col3;
          grunt> describe F;
          F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
          {code}
          {code}
          grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
          {code}
          {code}
          J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
          {code}
          {code}
          g = group l1 by b .. c;
          {code}

          Limitations:
          There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

          1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

          2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
          Note: there is a bug PIG-1939, because of which the use is restricted when schema is present. That should be fixed soon.
          example-
          {code}
          grunt> describe IN;
          Schema for IN unknown.

          -- Following statement is supported
          SORT = order IN by $2 .. $3, $6 ..;

          -- Following statement is NOT supported
          SORT = order IN by $2 .. $3, $6 ..;
          {code}


          Project-range ( '..' ) can be used to project a range of columns from input.
          For example, the expressions -
          .. $x : projects columns $0 through $x, inclusive
          
$x .. : projects columns through end, inclusive
          
$x .. $y : projects columns through $y, inclusive
          If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).


          This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938.

          It can be used in following statements -
          - foreach
          - join
          - order (also when it is within a nested foreach block)
          - group/cogroup

          Examples -
          {code}
          grunt> F = foreach IN generate (int)col0, col1 .. col3;
          grunt> describe F;
          F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
          {code}
          {code}
          grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
          {code}
          {code}
          J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
          {code}
          {code}
          g = group l1 by b .. c;
          {code}

          Limitations:
          There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

          1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

          2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
          example-
          {code}
          grunt> describe IN;
          Schema for IN unknown.

          -- Following statement is supported
          SORT = order IN by $2 .. $3, $6 ..;

          -- Following statement is NOT supported
          SORT = order IN by $2 .. $3, $6 ..;
          {code}

          Olga Natkovich made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Thejas M Nair made changes -
          Link This issue relates to PIG-2511 [ PIG-2511 ]

            People

            • Assignee:
              Thejas M Nair
              Reporter:
              Alan Gates
            • Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development