Pig
  1. Pig
  2. PIG-767

Schema reported from DESCRIBE and actual schema of inner bags are different.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The following script:

      urlContents = LOAD 'inputdir' USING BinStorage() AS (url:bytearray, pg:bytearray);
      – describe and dump are in-sync
      DESCRIBE urlContents;
      DUMP urlContents;

      urlContentsG = GROUP urlContents BY url;
      DESCRIBE urlContentsG;

      urlContentsF = FOREACH urlContentsG GENERATE group,urlContents.pg;

      DESCRIBE urlContentsF;
      DUMP urlContentsF;

      Prints for the DESCRIBE commands:

      urlContents:

      {url: chararray,pg: chararray}

      urlContentsG: {group: chararray,urlContents: {url: chararray,pg: chararray}}
      urlContentsF: {group: chararray,pg: {pg: chararray}}

      The reported schemas for urlContentsG and urlContentsF are wrong. They are also against the section "Schemas for Complex Data Types" in http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_Schemas.

      As expected, actual data observed from DUMP urlContentsG and DUMP urlContentsF do contain the tuple inside the inner bags.

      The correct schema for urlContentsG is: {group: chararray,urlContents: {t1:(url: chararray,pg: chararray)}}

      This may sound like a technicality, but it isn't. For instance, a UDF that assumes an inner bag of

      {chararray}

      will not work with

      {(chararray)}

      .

      1. PIG-767-4.patch
        22 kB
        Daniel Dai
      2. PIG-767-3.patch
        21 kB
        Daniel Dai
      3. PIG-767-2.patch
        18 kB
        Daniel Dai
      4. PIG-767-1.patch
        3 kB
        Daniel Dai

        Issue Links

          Activity

          Olga Natkovich made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Daniel Dai added a comment -

          Patch committed to trunk.

          Show
          Daniel Dai added a comment - Patch committed to trunk.
          Daniel Dai made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Hide
          Daniel Dai added a comment -
          Show
          Daniel Dai added a comment - Review notes: https://reviews.apache.org/r/278/
          Daniel Dai made changes -
          Attachment PIG-767-4.patch [ 12469118 ]
          Hide
          Daniel Dai added a comment -

          PIG-767-4.patch fix unit test failure of PIG-767-3.patch

          Show
          Daniel Dai added a comment - PIG-767 -4.patch fix unit test failure of PIG-767 -3.patch
          Daniel Dai made changes -
          Attachment PIG-767-3.patch [ 12468932 ]
          Hide
          Daniel Dai added a comment -

          Richard is right. Miss the change in Dereference in original patch. Attach PIG-767-3.patch.

          Show
          Daniel Dai added a comment - Richard is right. Miss the change in Dereference in original patch. Attach PIG-767 -3.patch.
          Hide
          Daniel Dai added a comment -

          It is depend on PIG-1786 to move describe to new logical plan. Once PIG-1786 commit, we will see the right schema.

          Show
          Daniel Dai added a comment - It is depend on PIG-1786 to move describe to new logical plan. Once PIG-1786 commit, we will see the right schema.
          Hide
          Richard Ding added a comment -

          Together with PIG-1786, the output of above describe commands is

          urlContents: {url: bytearray,pg: bytearray}
          urlContentsG: {group: bytearray,urlContents: {(url: bytearray,pg: bytearray)}}
          urlContentsF: {group: bytearray,{pg: bytearray}}
          

          should the output for urlContentsF be

          urlContentsF: {group: bytearray,{(pg: bytearray)}}
          
          Show
          Richard Ding added a comment - Together with PIG-1786 , the output of above describe commands is urlContents: {url: bytearray,pg: bytearray} urlContentsG: {group: bytearray,urlContents: {(url: bytearray,pg: bytearray)}} urlContentsF: {group: bytearray,{pg: bytearray}} should the output for urlContentsF be urlContentsF: {group: bytearray,{(pg: bytearray)}}
          Daniel Dai made changes -
          Attachment PIG-767-2.patch [ 12468179 ]
          Hide
          Daniel Dai added a comment -

          PIG-767-2.patch fix all unit test failures.

          Show
          Daniel Dai added a comment - PIG-767 -2.patch fix all unit test failures.
          Daniel Dai made changes -
          Link This issue is blocked by PIG-1786 [ PIG-1786 ]
          Daniel Dai made changes -
          Attachment PIG-767-1.patch [ 12467916 ]
          Hide
          Daniel Dai added a comment -

          The only place we generate bag schema without tuple inside is LOGroup. Attach PIG-767-1.patch to change LOGroup schema generation in new logical plan. Once describe migrate to new logical plan, this issue is fixed.

          Show
          Daniel Dai added a comment - The only place we generate bag schema without tuple inside is LOGroup. Attach PIG-767 -1.patch to change LOGroup schema generation in new logical plan. Once describe migrate to new logical plan, this issue is fixed.
          Alan Gates made changes -
          Assignee Alan Gates [ alangates ] Daniel Dai [ daijy ]
          Alan Gates made changes -
          Assignee Alan Gates [ alangates ]
          Olga Natkovich made changes -
          Fix Version/s 0.9.0 [ 12315191 ]
          Description
          The following script:

          urlContents = LOAD 'inputdir' USING BinStorage() AS (url:bytearray, pg:bytearray);
          -- describe and dump are in-sync
          DESCRIBE urlContents;
          DUMP urlContents;

          urlContentsG = GROUP urlContents BY url;
          DESCRIBE urlContentsG;

          urlContentsF = FOREACH urlContentsG GENERATE group,urlContents.pg;

          DESCRIBE urlContentsF;
          DUMP urlContentsF;


          Prints for the DESCRIBE commands:

          urlContents: {url: chararray,pg: chararray}
          urlContentsG: {group: chararray,urlContents: {url: chararray,pg: chararray}}
          urlContentsF: {group: chararray,pg: {pg: chararray}}

          The reported schemas for urlContentsG and urlContentsF are wrong. They are also against the section "Schemas for Complex Data Types" in http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_Schemas.

          As expected, actual data observed from DUMP urlContentsG and DUMP urlContentsF do contain the tuple inside the inner bags.

          The correct schema for urlContentsG is: {group: chararray,urlContents: {t1:(url: chararray,pg: chararray)}}

          This may sound like a technicality, but it isn't. For instance, a UDF that assumes an inner bag of {chararray} will not work with {(chararray)}.



          The following script:

          urlContents = LOAD 'inputdir' USING BinStorage() AS (url:bytearray, pg:bytearray);
          -- describe and dump are in-sync
          DESCRIBE urlContents;
          DUMP urlContents;

          urlContentsG = GROUP urlContents BY url;
          DESCRIBE urlContentsG;

          urlContentsF = FOREACH urlContentsG GENERATE group,urlContents.pg;

          DESCRIBE urlContentsF;
          DUMP urlContentsF;


          Prints for the DESCRIBE commands:

          urlContents: {url: chararray,pg: chararray}
          urlContentsG: {group: chararray,urlContents: {url: chararray,pg: chararray}}
          urlContentsF: {group: chararray,pg: {pg: chararray}}

          The reported schemas for urlContentsG and urlContentsF are wrong. They are also against the section "Schemas for Complex Data Types" in http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_Schemas.

          As expected, actual data observed from DUMP urlContentsG and DUMP urlContentsF do contain the tuple inside the inner bags.

          The correct schema for urlContentsG is: {group: chararray,urlContents: {t1:(url: chararray,pg: chararray)}}

          This may sound like a technicality, but it isn't. For instance, a UDF that assumes an inner bag of {chararray} will not work with {(chararray)}.



          Olga Natkovich made changes -
          Field Original Value New Value
          Fix Version/s 0.2.0 [ 12313783 ]
          Hide
          Santhosh Srinivasan added a comment -

          As a reference, please look at PIG-449

          The tuple inside a bag cannot be accessed by name or position. There are no semantics that support accessing a tuple inside a bag. However, the contents of the tuple inside a bag are accessible. As such, the presence or absence of a tuple inside a bag (as part of the schema) in the describe output does not matter.

          E.g.: urlContentsG: {group: chararray,urlContents: {t1:(url: chararray,pg: chararray)}}
          In the above schema, you can access urlContents.url. You will not be able to access urlContents.t1

          An example to illustrate this point follows:

          grunt> a = load 'input' as (bagColumn: bag{t: tuple(i: int, f: float)});
          grunt> b = foreach a generate bagColumn.t;
          2009-04-16 13:23:43,324 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only access to the elements of the tuple in the bag is allowed.
          
          Details at logfile: /homes/sms/src_pig/pig/trunk_optimizer_phase1/pig_1239908562989.log
          
          grunt> c = foreach a generate bagColumn.i;
          
          
          Show
          Santhosh Srinivasan added a comment - As a reference, please look at PIG-449 The tuple inside a bag cannot be accessed by name or position. There are no semantics that support accessing a tuple inside a bag. However, the contents of the tuple inside a bag are accessible. As such, the presence or absence of a tuple inside a bag (as part of the schema) in the describe output does not matter. E.g.: urlContentsG: {group: chararray,urlContents: {t1:(url: chararray,pg: chararray)}} In the above schema, you can access urlContents.url. You will not be able to access urlContents.t1 An example to illustrate this point follows: grunt> a = load 'input' as (bagColumn: bag{t: tuple(i: int , f: float )}); grunt> b = foreach a generate bagColumn.t; 2009-04-16 13:23:43,324 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only access to the elements of the tuple in the bag is allowed. Details at logfile: /homes/sms/src_pig/pig/trunk_optimizer_phase1/pig_1239908562989.log grunt> c = foreach a generate bagColumn.i;
          Hide
          George Mavromatis added a comment -

          Santosh: You are saying the exact same thing I said, i.e. that inner bags contain tuples and that describe does not report the tuples in the schema it prints. That's precisely the bug.

          Our examples are in fact similar too! I hope that my description shows that I understand that inner bags contain tuples ("As expected, actual data observed from DUMP... contain the tuple inside the inner bags").

          Show
          George Mavromatis added a comment - Santosh: You are saying the exact same thing I said, i.e. that inner bags contain tuples and that describe does not report the tuples in the schema it prints. That's precisely the bug. Our examples are in fact similar too! I hope that my description shows that I understand that inner bags contain tuples ("As expected, actual data observed from DUMP... contain the tuple inside the inner bags").
          Hide
          Santhosh Srinivasan added a comment -

          Firstly, the describe output is broken for bags in some cases. You will not see the inner tuple (t1 in your example). This can be fixed. It will not cause any problems for the runtime execution.

          Bags are containers of tuples. There are no bags that do not contain tuples unless the bags are empty. As a result, UDFs that assume an inner bag of chararray will always get a bag with chararray.

          I am pasting the output of similar queries and you should see the inner tuples in the output. Notice that you see the tuples in the bags. Also notice that the bags in the describe output do not have the inner tuples.

          
          grunt> a = load '/user/sms/data/student_tab.data' as (name: chararray, age:int, gpa: float);
          grunt> b = group a by age; 
          
          grunt> describe b;
          b: {group: int,a: {name: chararray,age: int,gpa: float}}
          
          grunt> dump b;
          (19,{(John,19,3.8F),(Jack,19,3.1F)})
          (20,{(Joe,20,3.5F),(Harry,20,3.2F),(Govinda,20,4.0F)})
          
          grunt> c = foreach b generate group, a.gpa;
          
          grunt> describe c;
          c: {group: int,gpa: {gpa: float}}
          
          grunt> dump c;
          (19,{(3.8F),(3.1F)})
          (20,{(3.5F),(3.2F),(4.0F)})
          
          Show
          Santhosh Srinivasan added a comment - Firstly, the describe output is broken for bags in some cases. You will not see the inner tuple (t1 in your example). This can be fixed. It will not cause any problems for the runtime execution. Bags are containers of tuples. There are no bags that do not contain tuples unless the bags are empty. As a result, UDFs that assume an inner bag of chararray will always get a bag with chararray. I am pasting the output of similar queries and you should see the inner tuples in the output. Notice that you see the tuples in the bags. Also notice that the bags in the describe output do not have the inner tuples. grunt> a = load '/user/sms/data/student_tab.data' as (name: chararray, age: int , gpa: float ); grunt> b = group a by age; grunt> describe b; b: {group: int ,a: {name: chararray,age: int ,gpa: float }} grunt> dump b; (19,{(John,19,3.8F),(Jack,19,3.1F)}) (20,{(Joe,20,3.5F),(Harry,20,3.2F),(Govinda,20,4.0F)}) grunt> c = foreach b generate group, a.gpa; grunt> describe c; c: {group: int ,gpa: {gpa: float }} grunt> dump c; (19,{(3.8F),(3.1F)}) (20,{(3.5F),(3.2F),(4.0F)})
          George Mavromatis created issue -

            People

            • Assignee:
              Daniel Dai
              Reporter:
              George Mavromatis
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development