Pig
  1. Pig
  2. PIG-2271

PIG regression in BinStorage/PigStorage in 0.9.1

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.9.1, 0.10.0
    • Fix Version/s: 0.9.2, 0.10.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      patch committed to 0.9 branch and trunk

      Description

      I'm using the 0.9.1 official release.

      My input data are read form a text file 'activity' (provided as attachment):

      00,1239698069000, <- this is the line that is not correctly handled
      01,1239698505000,b
      01,1239698369000,a
      02,1239698413000,b
      02,1239698553000,c
      02,1239698313000,a
      03,1239698316000,a
      03,1239698516000,c
      03,1239698416000,b
      03,1239698621000,d
      04,1239698417000,c
      

      My script is working correctly:

      -- load input data
      activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
      
      -- group input data
      activities = GROUP activities BY sid;
      activities = FOREACH activities GENERATE group, activities.(timestamp, name);
      
      -- store grouped activities in a temporary file
      STORE activities INTO 'tmp' USING PigStorage();
      
      -- reload grouped activities from the temporary file
      activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
      
      -- store grouped activities again in an output file
      STORE activities INTO 'output' USING PigStorage();
      

      After running this script, the 'output' file contains a correct result:

      00	{(1239698069000,)}
      01	{(1239698505000,b),(1239698369000,a)}
      02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
      03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
      04	{(1239698417000,c)}
      

      But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete:

      00	
      01	{(1239698505000,b),(1239698369000,a)}
      02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
      03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
      04	{(1239698417000,c)}
      

      The not working script is the following:

      -- load input data
      activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
      
      -- group input data
      activities = GROUP activities BY sid;
      activities = FOREACH activities GENERATE group, activities.(timestamp, name);
      
      -- store grouped activities in a temporary file
      STORE activities INTO 'tmp' USING PigStorage();
      
      -- reload grouped activities from the temporary file
      activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
      
      -- store grouped activities again in an output file
      STORE activities INTO 'output' USING PigStorage();
      

      So the issue seems to be located in the way the BinStorage() store or load bags.

      1. activity
        0.2 kB
        Vincent BARAT
      2. PIG-2271.0.patch
        2 kB
        Thejas M Nair
      3. PIG-2271.1.patch
        10 kB
        Thejas M Nair

        Activity

        Hide
        Daniel Dai added a comment -

        +1

        Show
        Daniel Dai added a comment - +1
        Hide
        Thejas M Nair added a comment -

        I would like to clarify that the type conversion fails only when user casts a complex type (tuple/bag/map) using a schema with inner schema, and one of the values inside is null.

        Show
        Thejas M Nair added a comment - I would like to clarify that the type conversion fails only when user casts a complex type (tuple/bag/map) using a schema with inner schema, and one of the values inside is null.
        Hide
        Thejas M Nair added a comment -

        PIG-2271.1.patch - patch with test cases.

        Show
        Thejas M Nair added a comment - PIG-2271 .1.patch - patch with test cases.
        Hide
        Thejas M Nair added a comment -

        PIG-2271.0.patch - initial patch. Test cases need to be added.

        The type conversion when user specified schema is present was not handling nulls correctly, it resulted in a cast failure. So the type conversion for the tuple that contained null was not successful.

        Show
        Thejas M Nair added a comment - PIG-2271 .0.patch - initial patch. Test cases need to be added. The type conversion when user specified schema is present was not handling nulls correctly, it resulted in a cast failure. So the type conversion for the tuple that contained null was not successful.
        Hide
        Vincent BARAT added a comment -

        Hi Daniel,

        I did more investigations and fully reformulated the issue. There is no more UDF function involved, and I reproduce it with the 0.9.1 official release.

        The issue is related to BinStorage (but I can also reproduce it using PigStorage(',')).

        This is a really blocking issue for me, as I need to use BinStorage() to load some binary data. This issue prevent be from using pig 0.9.1.

        Thanks a lot for your time.

        Show
        Vincent BARAT added a comment - Hi Daniel, I did more investigations and fully reformulated the issue. There is no more UDF function involved, and I reproduce it with the 0.9.1 official release. The issue is related to BinStorage (but I can also reproduce it using PigStorage(',')). This is a really blocking issue for me, as I need to use BinStorage() to load some binary data. This issue prevent be from using pig 0.9.1. Thanks a lot for your time.
        Hide
        Daniel Dai added a comment -

        Can you do these:
        1. Get the output schema for MyUDF. (describe activities)
        2. Use a different construct for BinStorage: BinStorage("org.apache.pig.builtin.Utf8StorageConverter")

        Show
        Daniel Dai added a comment - Can you do these: 1. Get the output schema for MyUDF. (describe activities) 2. Use a different construct for BinStorage: BinStorage("org.apache.pig.builtin.Utf8StorageConverter")

          People

          • Assignee:
            Thejas M Nair
            Reporter:
            Vincent BARAT
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development