Pig
  1. Pig
  2. PIG-3144

Erroneous map entry alias resolution leading to "Duplicate schema alias" errors

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11, 0.10.1
    • Fix Version/s: 0.12.0, 0.11.1
    • Component/s: None
    • Labels:
      None

      Description

      The following code illustrates a problem concerning alias resolution in pig

      The schema of D2 will incorrectly be described as containing two "age" fields. And the last step in the following script will lead to a "Duplicate schema alias" error message.

      I only encountered this bug when using aliases for map fields.

      DATA = LOAD 'file:///whatever' as (a:map[chararray], b:chararray);
      
      D1 = FOREACH DATA GENERATE a#'name' as name, a#'age' as age, b;
      
      D2 = FOREACH D1 GENERATE name, age, b;
      
      DESCRIBE D2;
      
      

      Output:

      D2: {
          age: chararray,
          age: chararray,
          b: chararray
      }
      
      
      D3 = FOREACH D2 GENERATE *;
      
      DESCRIBE D3;
      

      Output:

      <file file:///.../pig-bug-example.pig, line 20, column 16> Duplicate schema alias: age
      

      This error occurs in this form in Apache Pig version 0.11.0-SNAPSHOT (r6408). A less severe variant of this bug is also present in pig 0.10.1. In 0.10.1, the "Duplicate schema alias" error message won't occur, but the schema of D2 (see above) will still have wrong duplicate alias entries.

      1. PIG-3144-0.patch
        8 kB
        Jonathan Coveney
      2. PIG-3144-1.patch
        8 kB
        Jonathan Coveney
      3. PIG-3144-1-branch-0.11.patch
        8 kB
        Cheolsoo Park

        Issue Links

          Activity

          Hide
          Cheolsoo Park added a comment -

          Sounds good. Thanks Koji Noguchi for cleaning up the mess!

          Show
          Cheolsoo Park added a comment - Sounds good. Thanks Koji Noguchi for cleaning up the mess!
          Hide
          Koji Noguchi added a comment -

          FYI, I'm trying to revert the change from this jira at PIG-3492.

          Show
          Koji Noguchi added a comment - FYI, I'm trying to revert the change from this jira at PIG-3492 .
          Hide
          Cheolsoo Park added a comment -

          Attaching the 0.11 patch for the record.

          Show
          Cheolsoo Park added a comment - Attaching the 0.11 patch for the record.
          Hide
          Cheolsoo Park added a comment -

          Committed to trunk and 0.11.

          Note that I replaced @'s with relation names from the new test case in 0.11 because it isn't supported in 0.11.

          Show
          Cheolsoo Park added a comment - Committed to trunk and 0.11. Note that I replaced @'s with relation names from the new test case in 0.11 because it isn't supported in 0.11.
          Hide
          Cheolsoo Park added a comment -

          +1.

          The unit tests pass. I will commit it soon.

          Show
          Cheolsoo Park added a comment - +1. The unit tests pass. I will commit it soon.
          Hide
          Jonathan Coveney added a comment -

          Updated. Let me know how the tests come back. Thanks, Cheolsoo!

          Show
          Jonathan Coveney added a comment - Updated. Let me know how the tests come back. Thanks, Cheolsoo!
          Hide
          Cheolsoo Park added a comment -

          Hi Jonathan,
          Can you update the comment in LogicalRelationalOperator.fixDuplicateUids()?

          /**
           * In the case of a join it is possible for multiple columns to have been derived from the same
           * column and thus have duplicate UID's. This detects that case and resets the uid.
           * See PIG-3022 and PIG-3093 for more information.
           * @param fss a list of LogicalFieldSchemas to check the uids of
           */
          
          1. This is not a join-specific issue, so "in the case of a join" should be removed.
          2. PIG-3022 should be replaced with PIG-3020.

          Otherwise, the patch looks good to me. I will run unit tests.

          Show
          Cheolsoo Park added a comment - Hi Jonathan, Can you update the comment in LogicalRelationalOperator.fixDuplicateUids() ? /** * In the case of a join it is possible for multiple columns to have been derived from the same * column and thus have duplicate UID's. This detects that case and resets the uid. * See PIG-3022 and PIG-3093 for more information. * @param fss a list of LogicalFieldSchemas to check the uids of */ This is not a join-specific issue, so "in the case of a join" should be removed. PIG-3022 should be replaced with PIG-3020 . Otherwise, the patch looks good to me. I will run unit tests.
          Hide
          Jonathan Coveney added a comment -

          Someone should review this

          Show
          Jonathan Coveney added a comment - Someone should review this
          Hide
          Jonathan Coveney added a comment -

          Thanks for reporting this Kai. I had seen similar issues before, and this should be a generic fix for any case like this in a foreach.

          Show
          Jonathan Coveney added a comment - Thanks for reporting this Kai. I had seen similar issues before, and this should be a generic fix for any case like this in a foreach.
          Hide
          Jonathan Coveney added a comment -

          I agree that this is an issue, and it isn't too hard to fix. It uses a method that was developed for PIG-3020.

          Show
          Jonathan Coveney added a comment - I agree that this is an issue, and it isn't too hard to fix. It uses a method that was developed for PIG-3020 .

            People

            • Assignee:
              Jonathan Coveney
              Reporter:
              Kai Londenberg
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development