Pig
  1. Pig
  2. PIG-3581

Incorrect scope resolution with nested foreach

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None

      Description

      Consider the following script:

      A = LOAD 'test_data' AS (a: int, b: int);
      C = FOREACH A GENERATE *;
      B = FOREACH (GROUP A BY a) {
      	C = FILTER A BY b % 2 == 0;
      	D = FILTER A BY b % 2 == 1;
      	GENERATE group AS a, A.b AS every, C.b AS even, D.b AS odd;
      };
      DESCRIBE B;
      

      Notice that C is defined both inside the nested foreach as well as outside. I would expect that in the GENERATE inside the nested FOREACH, the C that is used will be the one that is defined inside. If that is not so, I think at least a warning is due.

      However, currently Pig silently assumes that the C you mean one is the one that is defined outside the nested FOREACH.

      Hence, the result of "DESCRIBE B" looks as follows:

      B: {
          a: int,
          every: {
              (
                  b: int
              )
          },
          even: int,
          odd: {
              (
                  b: int
              )
          }
      }
      

      If I remove the definition of C that is outside the foreach, then I get the following for "DESCRIBE B":

      B: {
          a: int,
          every: {
              (
                  b: int
              )
          },
          even: {
              (
                  b: int
              )
          },
          odd: {
              (
                  b: int
              )
          }
      }
      
      1. PIG-3581-2.patch
        4 kB
        Aniket Mokashi
      2. PIG-3581-1.patch
        2 kB
        Aniket Mokashi

        Issue Links

          Activity

          Hide
          Xuefu Zhang added a comment -

          Aniket Bhatnagar Could you please put a review board entry for this? Also, do you think if a test case would be helpful? Thanks.

          Show
          Xuefu Zhang added a comment - Aniket Bhatnagar Could you please put a review board entry for this? Also, do you think if a test case would be helpful? Thanks.
          Hide
          Aniket Mokashi added a comment -

          Xuefu Zhang, Thanks for looking into this. I will submit a review shortly.

          It's a minor change, I'm just changing the order of name resolution to give priority to nested variable over scalar. Essentially following code -

          // From --
          if(1) doScalar
          else if(2) doNested
          else if(3) other
          
          // To--
          if(2) doNested
          else if(1) doScalar
          else if(3) other
          

          If you insist, I can add a test for this, but I think we won't need it.

          Show
          Aniket Mokashi added a comment - Xuefu Zhang , Thanks for looking into this. I will submit a review shortly. It's a minor change, I'm just changing the order of name resolution to give priority to nested variable over scalar. Essentially following code - // From -- if (1) doScalar else if (2) doNested else if (3) other // To-- if (2) doNested else if (1) doScalar else if (3) other If you insist, I can add a test for this, but I think we won't need it.
          Hide
          Xuefu Zhang added a comment -

          Aniket Bhatnagar Thanks for your explanation. I agree that not every patch needs a test. While the change might be minor, it changed or corrected the behaviour. Unless we find it very difficult to construct a test, I'd think a test case is helpful in the sense that it not only provides a way to verify the fix but also prevents future break of this. As a fyi, there are some example of test cases in TestPigServer which might be followed.

          Show
          Xuefu Zhang added a comment - Aniket Bhatnagar Thanks for your explanation. I agree that not every patch needs a test. While the change might be minor, it changed or corrected the behaviour. Unless we find it very difficult to construct a test, I'd think a test case is helpful in the sense that it not only provides a way to verify the fix but also prevents future break of this. As a fyi, there are some example of test cases in TestPigServer which might be followed.
          Hide
          Aniket Mokashi added a comment -

          Thanks Xuefu Zhang. Attached new patch with tests. RB: https://reviews.apache.org/r/15852/

          Show
          Aniket Mokashi added a comment - Thanks Xuefu Zhang . Attached new patch with tests. RB: https://reviews.apache.org/r/15852/
          Hide
          Xuefu Zhang added a comment -

          Aniket Bhatnagar Thanks for spending time on the test case. I posted a couple of comments on rb.

          Show
          Xuefu Zhang added a comment - Aniket Bhatnagar Thanks for spending time on the test case. I posted a couple of comments on rb.
          Hide
          Aniket Mokashi added a comment -

          Thanks Xuefu Zhang, I do not understand your concern, scalars are allowed anywhere in the script and that change doesn't seem to affect that anyway.

          [JFYI, you are looping in the wrong Aniket above.]

          Show
          Aniket Mokashi added a comment - Thanks Xuefu Zhang , I do not understand your concern, scalars are allowed anywhere in the script and that change doesn't seem to affect that anyway. [JFYI, you are looping in the wrong Aniket above.]
          Hide
          Xuefu Zhang added a comment -

          Okay. Then I was mistaken indeed. Patch looks good. +1.

          My apology to another Aniket also.

          Show
          Xuefu Zhang added a comment - Okay. Then I was mistaken indeed. Patch looks good. +1. My apology to another Aniket also.
          Hide
          Aniket Mokashi added a comment -

          Thanks for the review, Xuefu Zhang!
          Committed to trunk..

          Show
          Aniket Mokashi added a comment - Thanks for the review, Xuefu Zhang ! Committed to trunk..
          Hide
          Cheolsoo Park added a comment -

          Aniket Mokashi, I think this patch introduced a regression. Consider the following query-

          a = LOAD 'foo' AS (x:int, y:chararray);
          b = GROUP a BY x;
          c = FOREACH b {
              expr = 'bar';
              filtered = FILTER a BY y == expr;
              GENERATE COUNT(filtered);
          }
          DESCRIBE c;
          

          This used to work in 0.11 but no longer works in trunk. It looks like 'expr' used to be resolved to a scalar expression ('bar'), but it's not the case anymore.

          My question are,
          1. Is it supported to define a local scalar expression inside a nested foreach? e.g. expr = 'bar';
          2. If so, can you fix the regression?

          Show
          Cheolsoo Park added a comment - Aniket Mokashi , I think this patch introduced a regression. Consider the following query- a = LOAD 'foo' AS (x: int , y:chararray); b = GROUP a BY x; c = FOREACH b { expr = 'bar'; filtered = FILTER a BY y == expr; GENERATE COUNT(filtered); } DESCRIBE c; This used to work in 0.11 but no longer works in trunk. It looks like 'expr' used to be resolved to a scalar expression ('bar'), but it's not the case anymore. My question are, 1. Is it supported to define a local scalar expression inside a nested foreach? e.g. expr = 'bar'; 2. If so, can you fix the regression?
          Hide
          Aniket Mokashi added a comment -

          Let me look into this in detail.

          In general, expr = 'bar'; filtered = FILTER a BY y == expr; - using scalar is very bad (crazy). It is going to store the word 'bar' in a file on HDFS and read it back.

          Show
          Aniket Mokashi added a comment - Let me look into this in detail. In general, expr = 'bar'; filtered = FILTER a BY y == expr; - using scalar is very bad (crazy). It is going to store the word 'bar' in a file on HDFS and read it back.
          Hide
          Cheolsoo Park added a comment -

          I am not saying it's a good thing to do. I was answering a user questions on the mailing list and curious whether it's supported or not. For the context, see here-
          http://www.mail-archive.com/user@pig.apache.org/msg08977.html

          Show
          Cheolsoo Park added a comment - I am not saying it's a good thing to do. I was answering a user questions on the mailing list and curious whether it's supported or not. For the context, see here- http://www.mail-archive.com/user@pig.apache.org/msg08977.html
          Hide
          Cheolsoo Park added a comment -

          It is going to store the word 'bar' in a file on HDFS and read it back.

          Plus, I don't think this is true. See the explain output in 0.11. There is no hdfs reading for 'bar'. It's a constant-

          c: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-20
          |
          |---c: New For Each(false)[bag] - scope-19
              |   |
              |   POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-13
              |   |
              |   |---RelationToExpressionProject[bag][*] - scope-12
              |       |
              |       |---filtered: Filter[bag] - scope-15
              |           |   |
              |           |   Equal To[boolean] - scope-18
              |           |   |
              |           |   |---Project[chararray][1] - scope-16
              |           |   |
              |           |   |---Constant(bar) - scope-17
              |           |
              |           |---Project[bag][1] - scope-14
              |
              |---b: Package[tuple]{int} - scope-9--------
          
          Show
          Cheolsoo Park added a comment - It is going to store the word 'bar' in a file on HDFS and read it back. Plus, I don't think this is true. See the explain output in 0.11. There is no hdfs reading for 'bar'. It's a constant- c: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-20 | |---c: New For Each( false )[bag] - scope-19 | | | POUserFunc(org.apache.pig.builtin.COUNT)[ long ] - scope-13 | | | |---RelationToExpressionProject[bag][*] - scope-12 | | | |---filtered: Filter[bag] - scope-15 | | | | | Equal To[ boolean ] - scope-18 | | | | | |---Project[chararray][1] - scope-16 | | | | | |---Constant(bar) - scope-17 | | | |---Project[bag][1] - scope-14 | |---b: Package[tuple]{ int } - scope-9--------
          Hide
          Aniket Mokashi added a comment -

          Understood. So, its a problem with nested_command resolution and not scalar resolution. I will submit a patch for this.

          Show
          Aniket Mokashi added a comment - Understood. So, its a problem with nested_command resolution and not scalar resolution. I will submit a patch for this.
          Hide
          Aniket Mokashi added a comment -

          Is it supported to define a local scalar expression inside a nested foreach? e.g. expr = 'bar';

          Yes. https://github.com/apache/pig/blob/branch-0.12/src/org/apache/pig/parser/QueryParser.g#L923

          If so, can you fix the regression?

          Yes

          Show
          Aniket Mokashi added a comment - Is it supported to define a local scalar expression inside a nested foreach? e.g. expr = 'bar'; Yes. https://github.com/apache/pig/blob/branch-0.12/src/org/apache/pig/parser/QueryParser.g#L923 If so, can you fix the regression? Yes
          Hide
          Cheolsoo Park added a comment -

          Aniket Mokashi, I put up a fix in PIG-3643.

          Show
          Cheolsoo Park added a comment - Aniket Mokashi , I put up a fix in PIG-3643 .

            People

            • Assignee:
              Aniket Mokashi
              Reporter:
              Venu Satuluri
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development