Pig
  1. Pig
  2. PIG-1434

Allow casting relations to scalars

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      PIG-1434 adds functionality that allows to cast elements of a single-tuple relation into a scalar value. The primary use case for this is using values of global aggregates in the follow up computations. For instance,

      A = load 'mydata' as (userid, clicks);

      B = group A all;

      C = foreach B genertate SUM(A.clicks) as total;

      D = foreach A generate userid, clicks/(double)C.total;

      dump D;

       

      This example allows computing the % of the clicks belonging to a particular user. Note that if the SUM as not given a name, a position can be used as well (userid, clicks/(double)C.$0); Also, note that if explicit cast is not used an implict cast would be inserted according to regular Pig rules. Also, please, note that when the schema can't be inferred bytearray is used.

       

      The relation can be used in any place where an expression of the type would make sense. This includes FOREACH, FILTER, and SPLIT.

       

      A multi field tuple can also be used:

       

      A = load 'mydata' as (userid, clicks);

      B = group A all;

      C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt;

      D = FILTER A by clicks > C.total/3

      E = foreach D generate userid, clicks/(double)C.total, cnt;

      Dump E;

       

      If a relation contains more than single tuple, a runtime error is generated: "Scalar has more than one row in the output"

      Show
      PIG-1434 adds functionality that allows to cast elements of a single-tuple relation into a scalar value. The primary use case for this is using values of global aggregates in the follow up computations. For instance, A = load 'mydata' as (userid, clicks); B = group A all; C = foreach B genertate SUM(A.clicks) as total; D = foreach A generate userid, clicks/(double)C.total; dump D;   This example allows computing the % of the clicks belonging to a particular user. Note that if the SUM as not given a name, a position can be used as well (userid, clicks/(double)C.$0); Also, note that if explicit cast is not used an implict cast would be inserted according to regular Pig rules. Also, please, note that when the schema can't be inferred bytearray is used.   The relation can be used in any place where an expression of the type would make sense. This includes FOREACH, FILTER, and SPLIT.   A multi field tuple can also be used:   A = load 'mydata' as (userid, clicks); B = group A all; C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt; D = FILTER A by clicks > C.total/3 E = foreach D generate userid, clicks/(double)C.total, cnt; Dump E;   If a relation contains more than single tuple, a runtime error is generated: "Scalar has more than one row in the output"
    • Tags:
      documentation

      Description

      This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.

      The proposal is to allow casting relations to scalar types in foreach.

      Example:

      A = load 'data' as (x, y, z);
      B = group A all;
      C = foreach B generate COUNT(A);
      .....
      X = ....
      Y = foreach X generate $1/(long) C;

      Couple of additional comments:

      (1) You can only cast relations including a single value or an error will be reported
      (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
      (3) Y will look for C closest to it.

      Implementation thoughts:

      The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
      (1) Store C
      (2) convert the cast to the UDF

      1. scalarImpl.patch
        44 kB
        Aniket Mokashi
      2. ScalarImpl1.patch
        43 kB
        Aniket Mokashi
      3. ScalarImpl5.patch
        57 kB
        Aniket Mokashi
      4. ScalarImplFinale.patch
        58 kB
        Aniket Mokashi
      5. ScalarImplFinale1.patch
        59 kB
        Aniket Mokashi
      6. ScalarImplFinaleRebase.patch
        56 kB
        Aniket Mokashi

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Aniket Mokashi
              Reporter:
              Olga Natkovich
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development