Pig
  1. Pig
  2. PIG-3379

Alias reuse in nested foreach causes PIG script to fail

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11.1
    • Fix Version/s: 0.12.0
    • Component/s: impl
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The following script fails:

      temp.pig
      Events = LOAD 'x' AS (eventTime:long, deviceId:chararray, eventName:chararray);
      Events = FOREACH Events GENERATE eventTime, deviceId, eventName;
      EventsPerMinute = GROUP Events BY (eventTime / 60000);
      EventsPerMinute = FOREACH EventsPerMinute {
        DistinctDevices = DISTINCT Events.deviceId;
        nbDevices = SIZE(DistinctDevices);
      
        DistinctDevices = FILTER Events BY eventName == 'xuaHeartBeat';
        nbDevicesWatching = SIZE(DistinctDevices);
      
        GENERATE $0*60000 as timeStamp, nbDevices as nbDevices, nbDevicesWatching as nbDevicesWatching;
      }
      EventsPerMinute = FILTER EventsPerMinute BY timeStamp >= 0  AND timeStamp < 100000;
      A = FOREACH EventsPerMinute GENERATE timeStamp;
      describe A;
      

      With the error:

      2013-07-16 11:31:20,450 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: 
      <file /home/xzhang/Documents/temp.pig, line 14, column 37> Invalid field projection. Projected field [timeStamp] does not exist in schema: deviceId:chararray.
      

      Using distinct alias name for the 2nd "DistinctDevices" fixes the problem. As an observation, removing the last filter statement also fixes the problem.

      1. PIG-3379.patch
        13 kB
        Xuefu Zhang
      2. PIG-3379-draft.patch
        1 kB
        Daniel Dai

        Activity

        Hide
        Xuefu Zhang added a comment -

        It seems related to PIG-1271 and PIG-2530, but both were marked as fixed.

        Show
        Xuefu Zhang added a comment - It seems related to PIG-1271 and PIG-2530 , but both were marked as fixed.
        Hide
        Xuefu Zhang added a comment -

        Correction, I meant PIG-1721 instead in above comment.

        Show
        Xuefu Zhang added a comment - Correction, I meant PIG-1721 instead in above comment.
        Hide
        Daniel Dai added a comment -

        Xuefu Zhang, seems we can have a simpler fix. Attach PIG-3379-draft.patch.

        How do you think?

        Show
        Daniel Dai added a comment - Xuefu Zhang , seems we can have a simpler fix. Attach PIG-3379 -draft.patch. How do you think?
        Hide
        Xuefu Zhang added a comment -

        Daniel Dai Thanks for your suggestion. While your patch does make "describe A" work, it generates the wrong result with the new test case in my patch. Further, the following is shown in the logical plan for "EventsPerMinute", in which we only have one "DistinctDevices" operator, which is incorrect. My original patch was to fix this, making sure that the projected impression is pointing to the right operator. Please let me know your further thoughts.

        ---EventsPerMinute: (Name: LOForEach Schema: timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long)
         
        (Name: LOGenerate[false,false,false] Schema: timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long)ColumnPrune:InputUids=[135, 134]ColumnPrune:OutputUids=[141, 143, 142]
           
          (Name: Multiply Type: long Uid: 141)
           
            ---group:(Name: Project Type: long Uid: 134 Input: 0 Column: )
           
            ---(Name: Cast Type: long Uid: 139)
           
            ---(Name: Constant Type: int Uid: 139)
           
          (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 142)
           
            ---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 Column: )
           
          (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 143)
           
            ---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 Column: )
         
          ---(Name: LOInnerLoad[0] Schema: group#134:long)
         
          ---DistinctDevices: (Name: LOFilter Schema: eventTime#106:long,deviceId#107:chararray,eventName#108:chararray)
           
          (Name: Equal Type: boolean Uid: 138)
           
            ---eventName:(Name: Project Type: chararray Uid: 108 Input: 0 Column: 2)
           
            ---(Name: Constant Type: chararray Uid: 137)
         
          ---Events: (Name: LOInnerLoad[1] Schema: eventTime#106:long,deviceId#107:chararray,eventName#108:chararray)
        Show
        Xuefu Zhang added a comment - Daniel Dai Thanks for your suggestion. While your patch does make "describe A" work, it generates the wrong result with the new test case in my patch. Further, the following is shown in the logical plan for "EventsPerMinute", in which we only have one "DistinctDevices" operator, which is incorrect. My original patch was to fix this, making sure that the projected impression is pointing to the right operator. Please let me know your further thoughts. ---EventsPerMinute: (Name: LOForEach Schema: timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long)   (Name: LOGenerate [false,false,false] Schema: timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long)ColumnPrune:InputUids= [135, 134] ColumnPrune:OutputUids= [141, 143, 142]       (Name: Multiply Type: long Uid: 141)         ---group:(Name: Project Type: long Uid: 134 Input: 0 Column: )         ---(Name: Cast Type: long Uid: 139)         ---(Name: Constant Type: int Uid: 139)       (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 142)         ---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 Column: )       (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 143)         ---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 Column: )     ---(Name: LOInnerLoad [0] Schema: group#134:long)     ---DistinctDevices: (Name: LOFilter Schema: eventTime#106:long,deviceId#107:chararray,eventName#108:chararray)       (Name: Equal Type: boolean Uid: 138)         ---eventName:(Name: Project Type: chararray Uid: 108 Input: 0 Column: 2)         ---(Name: Constant Type: chararray Uid: 137)     ---Events: (Name: LOInnerLoad [1] Schema: eventTime#106:long,deviceId#107:chararray,eventName#108:chararray)
        Hide
        Xuefu Zhang added a comment -

        Repost the logical plan snippet.

            |---EventsPerMinute: (Name: LOForEach Schema: timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long)
                |   |
                |   (Name: LOGenerate[false,false,false] Schema: timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long)ColumnPrune:InputUids=[135, 134]ColumnPrune:OutputUids=[141, 143, 142]
                |   |   |
                |   |   (Name: Multiply Type: long Uid: 141)
                |   |   |
                |   |   |---group:(Name: Project Type: long Uid: 134 Input: 0 Column: (*))
                |   |   |
                |   |   |---(Name: Cast Type: long Uid: 139)
                |   |       |
                |   |       |---(Name: Constant Type: int Uid: 139)
                |   |   |
                |   |   (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 142)
                |   |   |
                |   |   |---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 Column: (*))
                |   |   |
                |   |   (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 143)
                |   |   |
                |   |   |---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 Column: (*))
                |   |
                |   |---(Name: LOInnerLoad[0] Schema: group#134:long)
                |   |
                |   |---DistinctDevices: (Name: LOFilter Schema: eventTime#106:long,deviceId#107:chararray,eventName#108:chararray)
                |       |   |
                |       |   (Name: Equal Type: boolean Uid: 138)
                |       |   |
                |       |   |---eventName:(Name: Project Type: chararray Uid: 108 Input: 0 Column: 2)
                |       |   |
                |       |   |---(Name: Constant Type: chararray Uid: 137)
                |       |
                |       |---Events: (Name: LOInnerLoad[1] Schema: eventTime#106:long,deviceId#107:chararray,eventName#108:chararray)
                |
        
        
        
        Show
        Xuefu Zhang added a comment - Repost the logical plan snippet. |---EventsPerMinute: (Name: LOForEach Schema: timeStamp#141: long ,nbDevices#142: long ,nbDevicesWatching#143: long ) | | | (Name: LOGenerate[ false , false , false ] Schema: timeStamp#141: long ,nbDevices#142: long ,nbDevicesWatching#143: long )ColumnPrune:InputUids=[135, 134]ColumnPrune:OutputUids=[141, 143, 142] | | | | | (Name: Multiply Type: long Uid: 141) | | | | | |---group:(Name: Project Type: long Uid: 134 Input: 0 Column: (*)) | | | | | |---(Name: Cast Type: long Uid: 139) | | | | | |---(Name: Constant Type: int Uid: 139) | | | | | (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 142) | | | | | |---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 Column: (*)) | | | | | (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 143) | | | | | |---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 Column: (*)) | | | |---(Name: LOInnerLoad[0] Schema: group#134: long ) | | | |---DistinctDevices: (Name: LOFilter Schema: eventTime#106: long ,deviceId#107:chararray,eventName#108:chararray) | | | | | (Name: Equal Type: boolean Uid: 138) | | | | | |---eventName:(Name: Project Type: chararray Uid: 108 Input: 0 Column: 2) | | | | | |---(Name: Constant Type: chararray Uid: 137) | | | |---Events: (Name: LOInnerLoad[1] Schema: eventTime#106: long ,deviceId#107:chararray,eventName#108:chararray) |
        Hide
        Daniel Dai added a comment -

        Yes, you are right, it's not the dangling branch, it's the incorrect inner plan. Let me take a look again.

        Show
        Daniel Dai added a comment - Yes, you are right, it's not the dangling branch, it's the incorrect inner plan. Let me take a look again.
        Hide
        Daniel Dai added a comment -

        Missing LODistinct in the posted logical plan. Should be:

            |---EventsPerMinute: (Name: LOForEach Schema: timeStamp#56:long,nbDevices#57:long,nbDevicesWatching#58:long)
                |   |
                |   (Name: LOGenerate[false,false,false] Schema: timeStamp#56:long,nbDevices#57:long,nbDevicesWatching#58:long)ColumnPrune:InputUids=[50, 49]ColumnPrune:OutputUids=[58, 57, 56]
                |   |   |
                |   |   (Name: Multiply Type: long Uid: 56)
                |   |   |
                |   |   |---group:(Name: Project Type: long Uid: 49 Input: 0 Column: (*))
                |   |   |
                |   |   |---(Name: Cast Type: long Uid: 54)
                |   |       |
                |   |       |---(Name: Constant Type: int Uid: 54)
                |   |   |
                |   |   (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 57)
                |   |   |
                |   |   |---DistinctDevices:(Name: Project Type: bag Uid: 50 Input: 1 Column: (*))
                |   |   |
                |   |   (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 58)
                |   |   |
                |   |   |---DistinctDevices:(Name: Project Type: bag Uid: 50 Input: 2 Column: (*))
                |   |
                |   |---(Name: LOInnerLoad[0] Schema: group#49:long)
                |   |
                |   |---DistinctDevices: (Name: LODistinct Schema: deviceId#22:chararray)
                |   |   |
                |   |   |---1-3: (Name: LOForEach Schema: deviceId#22:chararray)
                |   |       |   |
                |   |       |   (Name: LOGenerate[false] Schema: deviceId#22:chararray)
                |   |       |   |   |
                |   |       |   |   deviceId:(Name: Project Type: chararray Uid: 22 Input: 0 Column: (*))
                |   |       |   |
                |   |       |   |---(Name: LOInnerLoad[1] Schema: deviceId#22:chararray)
                |   |       |
                |   |       |---Events: (Name: LOInnerLoad[1] Schema: eventTime#21:long,deviceId#22:chararray,eventName#23:chararray)
        

        The plan looks right.

        Talked with Xuefu Zhang, the idea is to use projectedOperator instead of alias at the time we convert alias to position. The newly introduced projectedOperator is only used in alias translation. After that, input# and col# will be use as the coordinates of ProjectExpression. Patch looks good. I will commit it once tests pass.

        Show
        Daniel Dai added a comment - Missing LODistinct in the posted logical plan. Should be: |---EventsPerMinute: (Name: LOForEach Schema: timeStamp#56: long ,nbDevices#57: long ,nbDevicesWatching#58: long ) | | | (Name: LOGenerate[ false , false , false ] Schema: timeStamp#56: long ,nbDevices#57: long ,nbDevicesWatching#58: long )ColumnPrune:InputUids=[50, 49]ColumnPrune:OutputUids=[58, 57, 56] | | | | | (Name: Multiply Type: long Uid: 56) | | | | | |---group:(Name: Project Type: long Uid: 49 Input: 0 Column: (*)) | | | | | |---(Name: Cast Type: long Uid: 54) | | | | | |---(Name: Constant Type: int Uid: 54) | | | | | (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 57) | | | | | |---DistinctDevices:(Name: Project Type: bag Uid: 50 Input: 1 Column: (*)) | | | | | (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 58) | | | | | |---DistinctDevices:(Name: Project Type: bag Uid: 50 Input: 2 Column: (*)) | | | |---(Name: LOInnerLoad[0] Schema: group#49: long ) | | | |---DistinctDevices: (Name: LODistinct Schema: deviceId#22:chararray) | | | | | |---1-3: (Name: LOForEach Schema: deviceId#22:chararray) | | | | | | | (Name: LOGenerate[ false ] Schema: deviceId#22:chararray) | | | | | | | | | deviceId:(Name: Project Type: chararray Uid: 22 Input: 0 Column: (*)) | | | | | | | |---(Name: LOInnerLoad[1] Schema: deviceId#22:chararray) | | | | | |---Events: (Name: LOInnerLoad[1] Schema: eventTime#21: long ,deviceId#22:chararray,eventName#23:chararray) The plan looks right. Talked with Xuefu Zhang , the idea is to use projectedOperator instead of alias at the time we convert alias to position. The newly introduced projectedOperator is only used in alias translation. After that, input# and col# will be use as the coordinates of ProjectExpression. Patch looks good. I will commit it once tests pass.
        Hide
        Daniel Dai added a comment -

        Patch committed to trunk. Thanks Xuefu!

        Show
        Daniel Dai added a comment - Patch committed to trunk. Thanks Xuefu!

          People

          • Assignee:
            Xuefu Zhang
            Reporter:
            Xuefu Zhang
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development