Pig
  1. Pig
  2. PIG-1434

Allow casting relations to scalars

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      PIG-1434 adds functionality that allows to cast elements of a single-tuple relation into a scalar value. The primary use case for this is using values of global aggregates in the follow up computations. For instance,

      A = load 'mydata' as (userid, clicks);

      B = group A all;

      C = foreach B genertate SUM(A.clicks) as total;

      D = foreach A generate userid, clicks/(double)C.total;

      dump D;

       

      This example allows computing the % of the clicks belonging to a particular user. Note that if the SUM as not given a name, a position can be used as well (userid, clicks/(double)C.$0); Also, note that if explicit cast is not used an implict cast would be inserted according to regular Pig rules. Also, please, note that when the schema can't be inferred bytearray is used.

       

      The relation can be used in any place where an expression of the type would make sense. This includes FOREACH, FILTER, and SPLIT.

       

      A multi field tuple can also be used:

       

      A = load 'mydata' as (userid, clicks);

      B = group A all;

      C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt;

      D = FILTER A by clicks > C.total/3

      E = foreach D generate userid, clicks/(double)C.total, cnt;

      Dump E;

       

      If a relation contains more than single tuple, a runtime error is generated: "Scalar has more than one row in the output"

      Show
      PIG-1434 adds functionality that allows to cast elements of a single-tuple relation into a scalar value. The primary use case for this is using values of global aggregates in the follow up computations. For instance, A = load 'mydata' as (userid, clicks); B = group A all; C = foreach B genertate SUM(A.clicks) as total; D = foreach A generate userid, clicks/(double)C.total; dump D;   This example allows computing the % of the clicks belonging to a particular user. Note that if the SUM as not given a name, a position can be used as well (userid, clicks/(double)C.$0); Also, note that if explicit cast is not used an implict cast would be inserted according to regular Pig rules. Also, please, note that when the schema can't be inferred bytearray is used.   The relation can be used in any place where an expression of the type would make sense. This includes FOREACH, FILTER, and SPLIT.   A multi field tuple can also be used:   A = load 'mydata' as (userid, clicks); B = group A all; C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt; D = FILTER A by clicks > C.total/3 E = foreach D generate userid, clicks/(double)C.total, cnt; Dump E;   If a relation contains more than single tuple, a runtime error is generated: "Scalar has more than one row in the output"
    • Tags:
      documentation

      Description

      This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.

      The proposal is to allow casting relations to scalar types in foreach.

      Example:

      A = load 'data' as (x, y, z);
      B = group A all;
      C = foreach B generate COUNT(A);
      .....
      X = ....
      Y = foreach X generate $1/(long) C;

      Couple of additional comments:

      (1) You can only cast relations including a single value or an error will be reported
      (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
      (3) Y will look for C closest to it.

      Implementation thoughts:

      The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
      (1) Store C
      (2) convert the cast to the UDF

      1. ScalarImplFinaleRebase.patch
        56 kB
        Aniket Mokashi
      2. ScalarImplFinale1.patch
        59 kB
        Aniket Mokashi
      3. ScalarImplFinale.patch
        58 kB
        Aniket Mokashi
      4. ScalarImpl5.patch
        57 kB
        Aniket Mokashi
      5. ScalarImpl1.patch
        43 kB
        Aniket Mokashi
      6. scalarImpl.patch
        44 kB
        Aniket Mokashi

        Issue Links

          Activity

          Hide
          Daniel Dai added a comment -

          We decide to change some implementation to solve the following problem:
          1. To decide when to add store. Currently, we parse statement by statement, until we saw a store, we merge that branch into the integrated logical plan. If we add store too late, the merge algorithm cannot see the store and discard this branch. If we add store too early (during the parsing of Y, in the example), then later we do not store/dump Y, we get a redundant store for C
          2. Implicit dependency between C -> Y. C will create a side file and Y will use it. However, this is not the normal data flow and should not be represented as a connection in logical plan

          Now we are exploring the following implementation:
          1. Add LOScalar, POScalar to represent scalar expression
          2. When parsing Y, we put LOScalar as a placeholder in the ForEach inner plan
          3. When parsing store (Y), we know we need to merge the store branch. In the mean time, we check the branch (Y) if it contains a scalar, if so, find what the scalar refers to (C), add a store to that branch, and merge that branch to the integrated logical plan
          4. Add a map reduce layer optimizer ScalarOptimizer. It check for map-reduce job contains POScalar, and map-reduce job POScalar contains the operator POScalar refers to, create a dependency between these two map-reduce jobs. ScalarOptimizer should run before MultiQueryOptimizer

          Show
          Daniel Dai added a comment - We decide to change some implementation to solve the following problem: 1. To decide when to add store. Currently, we parse statement by statement, until we saw a store, we merge that branch into the integrated logical plan. If we add store too late, the merge algorithm cannot see the store and discard this branch. If we add store too early (during the parsing of Y, in the example), then later we do not store/dump Y, we get a redundant store for C 2. Implicit dependency between C -> Y. C will create a side file and Y will use it. However, this is not the normal data flow and should not be represented as a connection in logical plan Now we are exploring the following implementation: 1. Add LOScalar, POScalar to represent scalar expression 2. When parsing Y, we put LOScalar as a placeholder in the ForEach inner plan 3. When parsing store (Y), we know we need to merge the store branch. In the mean time, we check the branch (Y) if it contains a scalar, if so, find what the scalar refers to (C), add a store to that branch, and merge that branch to the integrated logical plan 4. Add a map reduce layer optimizer ScalarOptimizer. It check for map-reduce job contains POScalar, and map-reduce job POScalar contains the operator POScalar refers to, create a dependency between these two map-reduce jobs. ScalarOptimizer should run before MultiQueryOptimizer
          Hide
          Aniket Mokashi added a comment -

          The proposal for scalars is as follows -

          A = load '1.txt' as (a1, a2);
          B = group A all;
          C = foreach B generate COUNT(A);
          Y = foreach A generate C;
          store Y into 'Ystore';
          

          Based on the schema of C, we detect that Y means to use C as a scalar and internally track it as scalar. Thus, operations like C * C are also allowed. The limitation is that C should have long convertible value (when stored into the file). Also (int) C would be allowed and will succeed if the cast operation succeeds.

          As mentioned by Daniel earlier, there are two challenges in introducing scalars--
          1. Addition of implicit store- We cannot do it too early (parsing), as we get redundant (implicit) store operation for rest of the commands in the script. If we do it too late, merge algorithm doesn't find the store and discards the branch that compiles and executes the store.
          To solve this, whenever we process a store plan after the parsing stage, we detect the existence of scalars into the plan and add required branches that has those scalars into the current plan. We also attach LOStores for the scalars and merge the required plan.
          2. Tracking of implicit dependency- Existence of scalar C needs to be converted into a implicit ReadScalar operation, but other than this it also needs to add dependency on the map-reduce job that generates this scalar value. We track this dependency by adding LOScalar, POScalar operators that carry the reference to the scalar they depend upon. When we compile the map reduce plan, we replace POScalar with POUserFunc to load the scalar value and mark the dependency between two map reduce jobs.

          I am attaching the patch with above mentioned changes.

          Few known issues-
          To track the dependencies of scalars, we need access to map of operators from one type of plan to other, but this map is generated by visitors. The same visitors are responsible for converting LOScalar ->POScalar -> POUserFunc. So, if a visitor visits LOScalar before LO associated with scalar ( C in example) we do not find PO associated with C.

          Show
          Aniket Mokashi added a comment - The proposal for scalars is as follows - A = load '1.txt' as (a1, a2); B = group A all; C = foreach B generate COUNT(A); Y = foreach A generate C; store Y into 'Ystore'; Based on the schema of C, we detect that Y means to use C as a scalar and internally track it as scalar. Thus, operations like C * C are also allowed. The limitation is that C should have long convertible value (when stored into the file). Also (int) C would be allowed and will succeed if the cast operation succeeds. As mentioned by Daniel earlier, there are two challenges in introducing scalars-- 1. Addition of implicit store- We cannot do it too early (parsing), as we get redundant (implicit) store operation for rest of the commands in the script. If we do it too late, merge algorithm doesn't find the store and discards the branch that compiles and executes the store. To solve this, whenever we process a store plan after the parsing stage, we detect the existence of scalars into the plan and add required branches that has those scalars into the current plan. We also attach LOStores for the scalars and merge the required plan. 2. Tracking of implicit dependency- Existence of scalar C needs to be converted into a implicit ReadScalar operation, but other than this it also needs to add dependency on the map-reduce job that generates this scalar value. We track this dependency by adding LOScalar, POScalar operators that carry the reference to the scalar they depend upon. When we compile the map reduce plan, we replace POScalar with POUserFunc to load the scalar value and mark the dependency between two map reduce jobs. I am attaching the patch with above mentioned changes. Few known issues- To track the dependencies of scalars, we need access to map of operators from one type of plan to other, but this map is generated by visitors. The same visitors are responsible for converting LOScalar ->POScalar -> POUserFunc. So, if a visitor visits LOScalar before LO associated with scalar ( C in example) we do not find PO associated with C.
          Hide
          Aniket Mokashi added a comment -

          Initial implemenation

          Show
          Aniket Mokashi added a comment - Initial implemenation
          Hide
          Aniket Mokashi added a comment -

          Submitting to hudson to check for test failures

          Show
          Aniket Mokashi added a comment - Submitting to hudson to check for test failures
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12448098/scalarImpl.patch
          against trunk revision 958053.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/351/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448098/scalarImpl.patch against trunk revision 958053. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/351/console This message is automatically generated.
          Hide
          Thejas M Nair added a comment -

          Should we have a special syntax when relational alias is used as a scalar ? Something like "[C]" (instead of just "C") .
          I think the special sytax is useful because -
          1. Users might accidentally use relation-op alias in this scalar context and not figure out the problem until runtime (when the evaluation of the relation results in more than one row).
          2. It will prevent surprises if a column with same name as alias is introduced in a new version of input data . Assuming the load function implements LoadMetadata.getSchema() and the load statement does not specify a new schema. With current plan, if there is a new column by same name, the column gets used instead of relation-op value. This will give different results, not what the user expects.

          Show
          Thejas M Nair added a comment - Should we have a special syntax when relational alias is used as a scalar ? Something like " [C] " (instead of just "C") . I think the special sytax is useful because - 1. Users might accidentally use relation-op alias in this scalar context and not figure out the problem until runtime (when the evaluation of the relation results in more than one row). 2. It will prevent surprises if a column with same name as alias is introduced in a new version of input data . Assuming the load function implements LoadMetadata.getSchema() and the load statement does not specify a new schema. With current plan, if there is a new column by same name, the column gets used instead of relation-op value. This will give different results, not what the user expects.
          Hide
          Dmitriy V. Ryaboy added a comment -

          A couple of thoughts that came out of the Pig conributor meeting:

          1) rather than scalar, we should make this work for single-tuple relations. That way a user can do something like this:

          A = load 'data' as (x, y, z);
          B = group A all;
          C = foreach B generate COUNT(A) as count, MAX(A.y) as max;
          .....
          X = ....
          Y = foreach X generate $1/(long) C.count, $2-(long) C.max;
          

          2) Writing the intermediate relation to a file can cause hotspots. We should push this into the distributed cache. In cases when the dist. cache is turned off, we can at least increase the replication factor to some large-ish number (10, maybe, like the jobs?)

          Show
          Dmitriy V. Ryaboy added a comment - A couple of thoughts that came out of the Pig conributor meeting: 1) rather than scalar, we should make this work for single-tuple relations. That way a user can do something like this: A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A) as count, MAX(A.y) as max; ..... X = .... Y = foreach X generate $1/( long ) C.count, $2-( long ) C.max; 2) Writing the intermediate relation to a file can cause hotspots. We should push this into the distributed cache. In cases when the dist. cache is turned off, we can at least increase the replication factor to some large-ish number (10, maybe, like the jobs?)
          Hide
          Aniket Mokashi added a comment -

          I agree to Thejas that we should have a way for user to specify that he means to use C as scalar. This will avoid errors in pig code.
          Thus, we have,

          Y = foreach X generate $1/(int) [C].count, $2- [C],max, ([C].$1+2);
          

          Do we fail if we find C to have more than one row, or do we just ignore it? Should we try to detect, that C has one row, in frontend?

          Show
          Aniket Mokashi added a comment - I agree to Thejas that we should have a way for user to specify that he means to use C as scalar. This will avoid errors in pig code. Thus, we have, Y = foreach X generate $1/( int ) [C].count, $2- [C],max, ([C].$1+2); Do we fail if we find C to have more than one row, or do we just ignore it? Should we try to detect, that C has one row, in frontend?
          Hide
          Thejas M Nair added a comment -

          Do we fail if we find C to have more than one row, or do we just ignore it?

          I think we should fail in that case, rather than have a surprising/unpredictable behavior.

          Should we try to detect, that C has one row, in frontend?

          This will not be possible in all cases. Eg if input has only one row, or a filter is filtering out all except one row.

          Show
          Thejas M Nair added a comment - Do we fail if we find C to have more than one row, or do we just ignore it? I think we should fail in that case, rather than have a surprising/unpredictable behavior. Should we try to detect, that C has one row, in frontend? This will not be possible in all cases. Eg if input has only one row, or a filter is filtering out all except one row.
          Hide
          Dmitriy V. Ryaboy added a comment -

          SQL fails at runtime when executing queries that require a single row to be returned. So, oracle won't complain if you do this, for example:

          
          SELECT foo.a, (SELECT c 
                         FROM bar 
                         WHERE foo.a = bar.a) 
          from foo
          
          

          unless the inner select produces more than one row. I think we should adopt the same approach – assume the query is innocent until proven guilty.

          -D

          Show
          Dmitriy V. Ryaboy added a comment - SQL fails at runtime when executing queries that require a single row to be returned. So, oracle won't complain if you do this, for example: SELECT foo.a, (SELECT c FROM bar WHERE foo.a = bar.a) from foo unless the inner select produces more than one row. I think we should adopt the same approach – assume the query is innocent until proven guilty. -D
          Hide
          Aniket Mokashi added a comment -

          bq Should we try to detect, that C has one row, in frontend?
          We can try to detect the pattern that makes something as scalar (by marking B (group by all, limit 1) as scalar and then C as scalar etc) and fail upfront otherwise...

          Show
          Aniket Mokashi added a comment - bq Should we try to detect, that C has one row, in frontend? We can try to detect the pattern that makes something as scalar (by marking B (group by all, limit 1) as scalar and then C as scalar etc) and fail upfront otherwise...
          Hide
          Thejas M Nair added a comment -

          We can try to detect the pattern that makes something as scalar (by marking B (group by all, limit 1) as scalar and then C as scalar etc) and fail upfront otherwise...

          I don't think we should limit this feature to the query patterns where we can detect that in frontend. In some cases it will depend on the data.

          Show
          Thejas M Nair added a comment - We can try to detect the pattern that makes something as scalar (by marking B (group by all, limit 1) as scalar and then C as scalar etc) and fail upfront otherwise... I don't think we should limit this feature to the query patterns where we can detect that in frontend. In some cases it will depend on the data.
          Hide
          Richard Ding added a comment -

          How about a "replicated" cross?

          A = load 'data' as (x, y, z);
          B = group A all;
          C = foreach B generate COUNT(A) as count, MAX(A.y) as max;
          .....
          X = ....
          Y = cross X, C using 'repl';
          Z = foreach Y generate X::$1/(long) C.count, X::$2-(long) C.max;
          
          Show
          Richard Ding added a comment - How about a "replicated" cross? A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A) as count, MAX(A.y) as max; ..... X = .... Y = cross X, C using 'repl'; Z = foreach Y generate X::$1/( long ) C.count, X::$2-( long ) C.max;
          Hide
          Dmitriy V. Ryaboy added a comment -

          Richard,
          In my opinion – Yes in principle, but not as a replacement for this.
          Cross is dangerous, I would rather have a constrained "implicit scalar tuple" that people use for the common case, and leave something like a replicated cross for power users (and then not limit it to number of rows in replicated relation).

          Agreed with not constraining this feature on the frontend, failing at runtime instead.

          Show
          Dmitriy V. Ryaboy added a comment - Richard, In my opinion – Yes in principle, but not as a replacement for this. Cross is dangerous, I would rather have a constrained "implicit scalar tuple" that people use for the common case, and leave something like a replicated cross for power users (and then not limit it to number of rows in replicated relation). Agreed with not constraining this feature on the frontend, failing at runtime instead.
          Hide
          Richard Ding added a comment -

          I agree that we should use the right syntax. What I meant was that it can be implemented as a 'replicated' cross which seems to solve the problems of implicit dependency and using distributed cache.

          Show
          Richard Ding added a comment - I agree that we should use the right syntax. What I meant was that it can be implemented as a 'replicated' cross which seems to solve the problems of implicit dependency and using distributed cache.
          Hide
          Thejas M Nair added a comment -

          I think the replicated cross is a good alternative to this feature, though this feature is probably more friendly for a beginner pig user. But if this feature makes the pig code very complicated/hacky (the dependency order and stuff), I think it might not be a bad idea to encourage the use of replicated-join instead .

          As a side note, we can actually get 'replicated cross' working using replicated join -
          eg -

           j = join l1 by 1, l2 by 1 using 'replicated';
          
          Show
          Thejas M Nair added a comment - I think the replicated cross is a good alternative to this feature, though this feature is probably more friendly for a beginner pig user. But if this feature makes the pig code very complicated/hacky (the dependency order and stuff), I think it might not be a bad idea to encourage the use of replicated-join instead . As a side note, we can actually get 'replicated cross' working using replicated join - eg - j = join l1 by 1, l2 by 1 using 'replicated';
          Hide
          Richard Ding added a comment -

          So all one needs to do is internally replace the line:

          Y = foreach X generate $1/(long) C.count, $2-(long) C.max;
          

          with

          Z = join X by 1, C by 1 using 'replicated';
          Y = foreach Z generate X::$1/(long) C.count, X::$2-(long) C.max;
          
          Show
          Richard Ding added a comment - So all one needs to do is internally replace the line: Y = foreach X generate $1/( long ) C.count, $2-( long ) C.max; with Z = join X by 1, C by 1 using 'replicated'; Y = foreach Z generate X::$1/( long ) C.count, X::$2-( long ) C.max;
          Hide
          Dmitriy V. Ryaboy added a comment -

          I think it's important in this particular case to ensure that C only contains one tuple, since multiple tuples will lead to very confusing output.
          Maybe

          C = LIMIT C 1;
          Z = join X by 1, C by 1 using 'replicated';
          Y = foreach Z generate X::$1/(long) C.count, X::$2-(long) C.max;
          

          (this is where an optimization that notes that you don't need m/r boundary if the relation you are limiting only has one partition would come in handy).

          Show
          Dmitriy V. Ryaboy added a comment - I think it's important in this particular case to ensure that C only contains one tuple, since multiple tuples will lead to very confusing output. Maybe C = LIMIT C 1; Z = join X by 1, C by 1 using 'replicated'; Y = foreach Z generate X::$1/( long ) C.count, X::$2-( long ) C.max; (this is where an optimization that notes that you don't need m/r boundary if the relation you are limiting only has one partition would come in handy).
          Hide
          Daniel Dai added a comment -

          We may also add some sanity check, instead of just doing a limit.

          C = foreach C generate CheckSingular(*);
          Z = join X by 1, C by 1 using 'replicated';
          Y = foreach Z generate X::$1/(long) C.count, X::$2-(long) C.max;
          

          CheckSingular will check if C only have one record.

          Show
          Daniel Dai added a comment - We may also add some sanity check, instead of just doing a limit. C = foreach C generate CheckSingular(*); Z = join X by 1, C by 1 using 'replicated'; Y = foreach Z generate X::$1/( long ) C.count, X::$2-( long ) C.max; CheckSingular will check if C only have one record.
          Hide
          Daniel Dai added a comment -

          We also need to enforce C only have one part file to do the check (use limit to achieve it).

          C = limit C 2;
          C = foreach C generate CheckSingular(*);
          Z = join X by 1, C by 1 using 'replicated';
          Y = foreach Z generate X::$1/(long) C.count, X::$2-(long) C.max;
          
          Show
          Daniel Dai added a comment - We also need to enforce C only have one part file to do the check (use limit to achieve it). C = limit C 2; C = foreach C generate CheckSingular(*); Z = join X by 1, C by 1 using 'replicated'; Y = foreach Z generate X::$1/( long ) C.count, X::$2-( long ) C.max;
          Hide
          Aniket Mokashi added a comment -

          Adding this support makes pig code complicated/hacky, because we conclude any not parsed alias (AliasFieldOrSpec) as scalar and try to resolve it as scalar at runtime.

          To simplify, square bracketed syntax is a better idea, for example-

          Y = foreach Z generate X::$1/(long) [C].count, X::$2-(long) [C].max;
          

          Otherwise, such queries (if typed by mistakes) can result into non-intuitive errors for users.

          Show
          Aniket Mokashi added a comment - Adding this support makes pig code complicated/hacky, because we conclude any not parsed alias (AliasFieldOrSpec) as scalar and try to resolve it as scalar at runtime. To simplify, square bracketed syntax is a better idea, for example- Y = foreach Z generate X::$1/( long ) [C].count, X::$2-( long ) [C].max; Otherwise, such queries (if typed by mistakes) can result into non-intuitive errors for users.
          Hide
          Alan Gates added a comment -

          Adding this support makes pig code complicated/hacky, because we conclude any not parsed alias (AliasFieldOrSpec) as scalar and try to resolve it as scalar at runtime.

          I don't understand when we'll see this issue. Could you give me examples of errors users might make that we'll miss. I definitely don't like the brackets.

          Show
          Alan Gates added a comment - Adding this support makes pig code complicated/hacky, because we conclude any not parsed alias (AliasFieldOrSpec) as scalar and try to resolve it as scalar at runtime. I don't understand when we'll see this issue. Could you give me examples of errors users might make that we'll miss. I definitely don't like the brackets.
          Hide
          Aniket Mokashi added a comment -

          I mean the cases where users type in some alias which are currently not possible to include inside a foreach statement and accidentally getting them treated as scalar by pig and then failing scripts at runtime (and may not fail in one-liner sample cases).
          For example-
          Y = foreach Z generate C.$0; where C is not a scalar. Currently, this would throw an error upfront, for erroneous usage (logical (do not know restrictions on foreach statement) or typing mistake) of C, But, after we add support of scalars, pig may conclude C to be used as a scalar and generate the plans accordingly.
          By introducing square bracketed syntax we can make sure that user intended to use C as a scalar and it wasn't introduced by mistake.A cast would also work for this, but as we have introduced scalar projections (C.count, C.max etc), we already have cases wherein user may mean to cast fields(count, max) rather than scalars themselves.

          Show
          Aniket Mokashi added a comment - I mean the cases where users type in some alias which are currently not possible to include inside a foreach statement and accidentally getting them treated as scalar by pig and then failing scripts at runtime (and may not fail in one-liner sample cases). For example- Y = foreach Z generate C.$0; where C is not a scalar. Currently, this would throw an error upfront, for erroneous usage (logical (do not know restrictions on foreach statement) or typing mistake) of C, But, after we add support of scalars, pig may conclude C to be used as a scalar and generate the plans accordingly. By introducing square bracketed syntax we can make sure that user intended to use C as a scalar and it wasn't introduced by mistake.A cast would also work for this, but as we have introduced scalar projections (C.count, C.max etc), we already have cases wherein user may mean to cast fields(count, max) rather than scalars themselves.
          Hide
          Alan Gates added a comment -

          Alright, I finally understand. I think the potential confusion for the user and the Pig parser is caused by the proposed way to handle multi-columned input. Rather than

          Y = foreach Z generate X::$1/(long) C.count, X::$2-(long) C.max;
          

          if we instead do

          Y = foreach Z generate X::$1/((tuple)C).count, X::$2 - ((tuple)C).max;
          

          then I believe it is clear for both user and parser what is happening.

          In each case C is being cast to a tuple and then fields read out of it. C is not being cast to a long. Then the feature remains basically as originally proposed. The relation being cast must have one record and one field. That one field can be a tuple to handle the case where the record has multiple fields. But Pig will still reads it as a single column which is a tuple, and the user will need to cast it accordingly.

          This should also avoid accidental usage. In the example above:

          Y = foreach Z generate X::$1/C.count, X::$2 - C.max;
          

          should still be an error because the type checker should not be able to find C as a tuple anywhere in its symbol table.

          Show
          Alan Gates added a comment - Alright, I finally understand. I think the potential confusion for the user and the Pig parser is caused by the proposed way to handle multi-columned input. Rather than Y = foreach Z generate X::$1/( long ) C.count, X::$2-( long ) C.max; if we instead do Y = foreach Z generate X::$1/((tuple)C).count, X::$2 - ((tuple)C).max; then I believe it is clear for both user and parser what is happening. In each case C is being cast to a tuple and then fields read out of it. C is not being cast to a long. Then the feature remains basically as originally proposed. The relation being cast must have one record and one field. That one field can be a tuple to handle the case where the record has multiple fields. But Pig will still reads it as a single column which is a tuple, and the user will need to cast it accordingly. This should also avoid accidental usage. In the example above: Y = foreach Z generate X::$1/C.count, X::$2 - C.max; should still be an error because the type checker should not be able to find C as a tuple anywhere in its symbol table.
          Hide
          Dmitriy V. Ryaboy added a comment -

          +1 for casting as tuple. Though it may have to look like

          Y = foreach Z generate X::$1/(long) ((tuple)C).count, X::$2 - (long) ((tuple)C).max;
          

          Definitely -1 on the bracket syntax.. it seems very non-intuitive.

          Show
          Dmitriy V. Ryaboy added a comment - +1 for casting as tuple. Though it may have to look like Y = foreach Z generate X::$1/( long ) ((tuple)C).count, X::$2 - ( long ) ((tuple)C).max; Definitely -1 on the bracket syntax.. it seems very non-intuitive.
          Hide
          Aniket Mokashi added a comment -

          LOScalar keeps track of the scalars in the logical plan along with the reference to the scalar alias.
          During compilation, we add LOStores to respective scalars, we also merge plans as needed.
          POScalar is later replaced by POUserFunc and appropriate dependency is added between the MROpers.
          Tested with store, dump, explain.

          Show
          Aniket Mokashi added a comment - LOScalar keeps track of the scalars in the logical plan along with the reference to the scalar alias. During compilation, we add LOStores to respective scalars, we also merge plans as needed. POScalar is later replaced by POUserFunc and appropriate dependency is added between the MROpers. Tested with store, dump, explain.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12449903/ScalarImpl1.patch
          against trunk revision 965559.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/347/testReport/
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/347/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12449903/ScalarImpl1.patch against trunk revision 965559. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/347/testReport/ Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/347/console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12450872/ScalarImplFinale.patch
          against trunk revision 980276.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          -1 javac. The applied patch generated 146 javac compiler warnings (more than the trunk's current 145 warnings).

          -1 findbugs. The patch appears to introduce 5 new Findbugs warnings.

          -1 release audit. The applied patch generated 406 release audit warnings (more than the trunk's current 400 warnings).

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/366/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/366/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/366/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/366/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12450872/ScalarImplFinale.patch against trunk revision 980276. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. -1 javac. The applied patch generated 146 javac compiler warnings (more than the trunk's current 145 warnings). -1 findbugs. The patch appears to introduce 5 new Findbugs warnings. -1 release audit. The applied patch generated 406 release audit warnings (more than the trunk's current 400 warnings). +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/366/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/366/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/366/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/366/console This message is automatically generated.
          Hide
          Richard Ding added a comment -

          It looks really good. Additional corner cases need to be considered:

          • missing field in scalar file
          • empty scalar file
          • empty input directory

          Some minor issues:

          • In PigServer. FileLocalizer.getTemporaryPath should be changed to use non-deprecated method.
          • In ScalarFinder, method isScalarPresent is not used.
          • In MRCompiler, variable scalarPhyFinder should be local so that ScalarPhyFinder can be simplified.

          Also add a test case for using scalar in multi-query would be good.

          Show
          Richard Ding added a comment - It looks really good. Additional corner cases need to be considered: missing field in scalar file empty scalar file empty input directory Some minor issues: In PigServer. FileLocalizer.getTemporaryPath should be changed to use non-deprecated method. In ScalarFinder, method isScalarPresent is not used. In MRCompiler, variable scalarPhyFinder should be local so that ScalarPhyFinder can be simplified. Also add a test case for using scalar in multi-query would be good.
          Hide
          Aniket Mokashi added a comment -

          Missing field in scalar file are handled by returning null. Empty scalar file/empty scalar directory tested.
          ScalarPhyFinder is moved as local variable. Removed redundant comments and apis inside visitors.
          Added a new testcase for multiquery.
          Fixed findbugs, javac and javadoc warnings (needs findbugs exclusion since we throw an error when second line is found (not_null) in UDF).

          Show
          Aniket Mokashi added a comment - Missing field in scalar file are handled by returning null. Empty scalar file/empty scalar directory tested. ScalarPhyFinder is moved as local variable. Removed redundant comments and apis inside visitors. Added a new testcase for multiquery. Fixed findbugs, javac and javadoc warnings (needs findbugs exclusion since we throw an error when second line is found (not_null) in UDF).
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12451096/ScalarImplFinale1.patch
          against trunk revision 980930.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          -1 release audit. The applied patch generated 409 release audit warnings (more than the trunk's current 403 warnings).

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/368/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/368/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/368/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/368/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12451096/ScalarImplFinale1.patch against trunk revision 980930. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. -1 release audit. The applied patch generated 409 release audit warnings (more than the trunk's current 403 warnings). +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/368/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/368/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/368/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/368/console This message is automatically generated.
          Hide
          Aniket Mokashi added a comment -

          Removed the unused variable(findbug warning)

          Show
          Aniket Mokashi added a comment - Removed the unused variable(findbug warning)
          Hide
          Aniket Mokashi added a comment -

          Attaching rebased version of the patch...

          Show
          Aniket Mokashi added a comment - Attaching rebased version of the patch...
          Hide
          Richard Ding added a comment -

          The patch is committed to trunk. Thanks Aniket for contributing this feature.

          Show
          Richard Ding added a comment - The patch is committed to trunk. Thanks Aniket for contributing this feature.
          Hide
          Aniket Mokashi added a comment -

          Comments on the finalized syntax--

          With the above changes Pig now supports -

          Y = foreach X generate $1/(long) C.count, $2-(long) C.max;
          

          1. Casts are optional and the datatype of scalar depends on the schema of C (ie depending on the schema of C, we add the casts implicitly. So, typically, count is a long and max is a double). In case of undeclared(null) schema for C, default type of scalar is chararray.

          2. Projections are mandatory. For example

          Y = foreach X generate C; // is an *error*
          

          We need to use-

          Y = foreach X generate C.$0; 
          

          3. Check if C is a scalar or not is not performed until runtime, thus it will fail at the time of execution of UDF with ExecException("Scalar has more than one row in the output").

          Show
          Aniket Mokashi added a comment - Comments on the finalized syntax-- With the above changes Pig now supports - Y = foreach X generate $1/( long ) C.count, $2-( long ) C.max; 1. Casts are optional and the datatype of scalar depends on the schema of C (ie depending on the schema of C, we add the casts implicitly. So, typically, count is a long and max is a double). In case of undeclared(null) schema for C, default type of scalar is chararray . 2. Projections are mandatory. For example Y = foreach X generate C; // is an *error* We need to use- Y = foreach X generate C.$0; 3. Check if C is a scalar or not is not performed until runtime, thus it will fail at the time of execution of UDF with ExecException("Scalar has more than one row in the output").
          Hide
          Thejas M Nair added a comment -

          1. Casts are optional and the datatype of scalar depends on the schema of C (ie depending on the schema of C, we add the casts implicitly. So, typically, count is a long and max is a double). In case of undeclared(null) schema for C, default type of scalar is chararray.

          The default type everywhere else in pig is bytearray. I think we should follow that convention here as well . Any reason not to do that ? (The change can be part of a separate jira.)

          Show
          Thejas M Nair added a comment - 1. Casts are optional and the datatype of scalar depends on the schema of C (ie depending on the schema of C, we add the casts implicitly. So, typically, count is a long and max is a double). In case of undeclared(null) schema for C, default type of scalar is chararray. The default type everywhere else in pig is bytearray. I think we should follow that convention here as well . Any reason not to do that ? (The change can be part of a separate jira.)
          Hide
          Thejas M Nair added a comment -

          Changed the release note to incorporate the change of default datatype to bytearray in PIG-1572

          Show
          Thejas M Nair added a comment - Changed the release note to incorporate the change of default datatype to bytearray in PIG-1572

            People

            • Assignee:
              Aniket Mokashi
              Reporter:
              Olga Natkovich
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development