Pig
  1. Pig
  2. PIG-2875

Add recursive record support to AvroStorage

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.11
    • Component/s: piggybank
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently, AvroStorage does not allow recursive records in Avro schema because it is not possible to define Pig schema for recursive records. (i.e. records that have self-referencing fields cause an infinite loop, so they are not supported.)

      Even though there is no natural way of handling recursive records in Pig schema, I'd like to propose the following workaround: mapping recursive records to bytearray.

      Take for example the following Avro schema:

      {
        "type" : "record",
        "name" : "RECURSIVE_RECORD",
        "fields" : [ {
          "name" : "value",
          "type" : [ "null", "int" ]
        }, {
          "name" : "next",
          "type" : [ "null", "RECURSIVE_RECORD" ]
        } ]
      }
      

      and the following data:

      {"value":1,"next":{"RECURSIVE_RECORD":{"value":2,"next":{"RECURSIVE_RECORD":{"value":3,"next":null}}}}} 
      {"value":2,"next":{"RECURSIVE_RECORD":{"value":3,"next":null}}} 
      {"value":3,"next":null}
      

      Then, we can define Pig schema as follows:

      {value: int,next: bytearray}
      

      Even though Pig thinks that the "next" fields are bytearray, they're actually loaded as tuples since AvroStorage uses Avro schema when loading files.

      grunt> in = LOAD 'test_recursive_schema.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage ();
      grunt> dump in;
      (1,(2,(3,)))
      (2,(3,))
      (3,)
      

      At this point, we have discrepancy between Avro schema and Pig schema; nevertheless, we can still refer to each field of tuples as follows:

      grunt> first = FOREACH in GENERATE $0;
      grunt> dump first;
      (1)
      (2)
      (3)
      
      or
      
      grunt> second = FOREACH in GENERATE $1.$0;
      grunt> dump second;
      (2)
      (3)
      ()
      

      Lastly, we can store these tuples as Avro files by specifying schema. Since we can no longer construct Avro schema from Pig schema, it is required for the user to provide Avro schema via the 'schema' parameter in STORE function.

      grunt> STORE first INTO 'output' USING org.apache.pig.piggybank.storage.avro.AvroStorage ( 'schema', '[ "null", "int" ]' );
      
      or
      
      grunt> STORE in INTO 'output' USING org.apache.pig.piggybank.storage.avro.AvroStorage ( 'schema', '
      {
        "type" : "record",
        "name" : "recursive_schema",
        "fields" : [ { 
          "name" : "value",
          "type" : [ "null", "int" ]
        }, {
          "name" : "next",
          "type" : [ "null", "recursive_schema" ]
        } ] 
      }
      ' );
      

      To implement this workaround, the following work is required:

      • Update the current generic union check so that it can handle recursive records. Currently, AvroStorage checks if the Avro schema contains 1) recursive records and 2) generic unions, and fails if so. But since I am going to remove the 1st check, the 2nd check should be able to handle recursive records without stack overflow.
      • Update AvroSchema2Pig so that recursive records can be detected and mapped to bytearrays in Pig schema.
      • Add the 'no_schema_check' parameter to STORE function so that results can be stored even though there exists discrepancy between Avro schema and Pig schema. Since Avro schema for STORE function cannot be constructed from Pig schema, it has to be specified by the user via the 'schema' parameter, and schema check has to be disabled by 'no_schema_check'.
      • Update AvroStorage wiki.
      • Add unit tests.

      I do not think that any incompatibility issues will be introduced by this.

      P.S. The reason why I chose to map recursive records to bytearray instead of empty tuple is because I cannot refer to any field if I use empty tuple. For example, if Pig schema is defined as follows:

      {value: int,next: ()}
      

      I get an exception when I attempt to refer to any field in loaded tuples since their schema is not defined (i.e. empty tuple).

      ERROR 1127: Index 0 out of range in schema
      

      This is all what I found by trials and errors, so there might be something that I am missing here. If so, please let me know.

      Thanks!

      1. avro_test_files.tar.gz
        1 kB
        Cheolsoo Park
      2. PIG-2869.patch
        34 kB
        Cheolsoo Park
      3. PIG-2875.patch
        35 kB
        Cheolsoo Park
      4. PIG-2875-2.patch
        59 kB
        Cheolsoo Park
      5. PIG-2875-3.patch
        59 kB
        Cheolsoo Park
      6. PIG-2875-4.patch
        59 kB
        Cheolsoo Park

        Activity

        Hide
        Santhosh Srinivasan added a comment -

        All unit test cases in piggybank passed. Patch has been committed. Thanks Cheolsoo.

        Show
        Santhosh Srinivasan added a comment - All unit test cases in piggybank passed. Patch has been committed. Thanks Cheolsoo.
        Hide
        Cheolsoo Park added a comment -

        Fixed a typo.

        Show
        Cheolsoo Park added a comment - Fixed a typo.
        Hide
        Cheolsoo Park added a comment -

        There was a typo in the code:

        - String expected = basedir + "expected_testRecursiveRecordReference1";
        + String expected = basedir + "expected_testRecursiveRecordReference1.avro";
        

        Uploaded a new patch with the fix. I verified that all the tests pass in both 20 and 23.

        Thanks!

        Show
        Cheolsoo Park added a comment - There was a typo in the code: - String expected = basedir + "expected_testRecursiveRecordReference1" ; + String expected = basedir + "expected_testRecursiveRecordReference1.avro" ; Uploaded a new patch with the fix. I verified that all the tests pass in both 20 and 23. Thanks!
        Hide
        Santhosh Srinivasan added a comment -

        Cheolsoo Park I see one test failure when I run the following commands

        
        $ ant clean compile-test jar-withouthadoop -Dhadoopversion=20
        $ cd contrib/piggybank/java/
        $ ant clean test -Dhadoopversion=20
        ...
            [junit] Tests run: 33, Failures: 1, Errors: 0, Time elapsed: 107.203 sec
            [junit] Test org.apache.pig.piggybank.test.storage.avro.TestAvroStorage FAILED
        ...
        
        
        Testcase: testRecursiveRecordReference1 took 3.621 sec
                FAILED
        Expected output does not exists!
        junit.framework.AssertionFailedError: Expected output does not exists!
                at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.getExpected(TestAvroStorage.java:933)
                at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.verifyResults(TestAvroStorage.java:891)
                at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.verifyResults(TestAvroStorage.java:880)
                at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testRecursiveRecordReference1(TestAvroStorage.java:292)
        
        Show
        Santhosh Srinivasan added a comment - Cheolsoo Park I see one test failure when I run the following commands $ ant clean compile-test jar-withouthadoop -Dhadoopversion=20 $ cd contrib/piggybank/java/ $ ant clean test -Dhadoopversion=20 ... [junit] Tests run: 33, Failures: 1, Errors: 0, Time elapsed: 107.203 sec [junit] Test org.apache.pig.piggybank.test.storage.avro.TestAvroStorage FAILED ... Testcase: testRecursiveRecordReference1 took 3.621 sec FAILED Expected output does not exists! junit.framework.AssertionFailedError: Expected output does not exists! at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.getExpected(TestAvroStorage.java:933) at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.verifyResults(TestAvroStorage.java:891) at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.verifyResults(TestAvroStorage.java:880) at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testRecursiveRecordReference1(TestAvroStorage.java:292)
        Hide
        Cheolsoo Park added a comment -

        This was originally PIG-2869 and re-created due to INFRA-5131.

        Show
        Cheolsoo Park added a comment - This was originally PIG-2869 and re-created due to INFRA-5131 .
        Hide
        Cheolsoo Park added a comment -
        Show
        Cheolsoo Park added a comment - Review board: https://reviews.apache.org/r/6536/

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Cheolsoo Park
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development