Pig
  1. Pig
  2. PIG-2579

Support for multiple input schemas in AvroStorage

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.9.2, 0.11
    • Fix Version/s: 0.11
    • Component/s: piggybank
    • Labels:
      None
    • Patch Info:
      Patch Available
    • Hadoop Flags:
      Reviewed

      Description

      This is a barebones patch for AvroStorage which enables support of multiple input schemas. The assumption is that the input consists of avro files having different schemas that can be unioned, e.g., flat records.

      A simple illustrative example is attached (avro_storage_union_schema_test.tar.gz): run create_avro1.pig, followed by create_avro2.pig, followed by read_avro.pig.

      1. avro_storage_union_schema.patch
        50 kB
        Stan Rosenberg
      2. avro_storage_union_schema_test.tar.gz
        0.6 kB
        Stan Rosenberg
      3. PIG-2579-2.patch
        51 kB
        Cheolsoo Park
      4. PIG-2579-2-avro_test_files.tar.gz
        1 kB
        Cheolsoo Park
      5. PIG-2579-3.patch
        51 kB
        Cheolsoo Park
      6. PIG-2579-4.patch
        58 kB
        Cheolsoo Park
      7. PIG-2579-5.patch
        58 kB
        Cheolsoo Park
      8. PIG-2579-6.patch
        58 kB
        Cheolsoo Park

        Activity

        Hide
        Cheolsoo Park added a comment -

        I updated the original Stan's patch re-basing it to trunk. While I kept the core logic unchanged, I made some modifications as follows:

        1. Removed glob pattern related code as it's resolved in PIG-2492.
        2. Added an option 'multiple_schema' to AvroStorage. By default, AvroStorage assumes that all the input files have the same schema, but if 'multiple_schema' is passed to load function, it tries to merge every input schema.
        3. Allows multiple schemas with the same name. I use paths to identify schemas instead of their names.
        4. Refactored code.
        5. Added unit tests.

        I think that the most arguable part is how to merge two different schemas into one. In shorts, the rules are as follows:

        1. Different primitive types can be merged if certain conditions are met. Please see AvroStorageUtils.mergeType() for more details.
        2. Only the same kind of complex types can be merged. e.g. record + record => ok, but record + array => error.
        3. For records, the union of fields is returned.
        4. For arrays/maps, their element types/value types are merged.
        5. For unions, the union of unions is returned.
        6. For fixeds, only the same size of fixeds can be merged.

        It's easy to see in a unit test (TestAvroStorageUtils) what's expected when two schemas are merged.

        Please let me know if you have any questions/concerns.

        Thanks!

        Show
        Cheolsoo Park added a comment - I updated the original Stan's patch re-basing it to trunk. While I kept the core logic unchanged, I made some modifications as follows: Removed glob pattern related code as it's resolved in PIG-2492 . Added an option 'multiple_schema' to AvroStorage. By default, AvroStorage assumes that all the input files have the same schema, but if 'multiple_schema' is passed to load function, it tries to merge every input schema. Allows multiple schemas with the same name. I use paths to identify schemas instead of their names. Refactored code. Added unit tests. I think that the most arguable part is how to merge two different schemas into one. In shorts, the rules are as follows: Different primitive types can be merged if certain conditions are met. Please see AvroStorageUtils.mergeType() for more details. Only the same kind of complex types can be merged. e.g. record + record => ok, but record + array => error. For records, the union of fields is returned. For arrays/maps, their element types/value types are merged. For unions, the union of unions is returned. For fixeds, only the same size of fixeds can be merged. It's easy to see in a unit test (TestAvroStorageUtils) what's expected when two schemas are merged. Please let me know if you have any questions/concerns. Thanks!
        Hide
        Cheolsoo Park added a comment -
        Show
        Cheolsoo Park added a comment - Review board: https://reviews.apache.org/r/6884/diff/
        Hide
        Cheolsoo Park added a comment -

        I rebased the patch to trunk.

        Show
        Cheolsoo Park added a comment - I rebased the patch to trunk.
        Hide
        Cheolsoo Park added a comment -

        Updated the patch based on Santhosh's comments in review board.

        Show
        Cheolsoo Park added a comment - Updated the patch based on Santhosh's comments in review board.
        Hide
        Cheolsoo Park added a comment -

        Updating the patch.

        @Santhosh,
        Can you please also remove the following files when committing the patch? They are no longer used by tests so should be deleted.

        #	deleted:    contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_generic_union_schema.avro
        #	deleted:    contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_recursive_schema.avro
        
        Show
        Cheolsoo Park added a comment - Updating the patch. @Santhosh, Can you please also remove the following files when committing the patch? They are no longer used by tests so should be deleted. # deleted: contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_generic_union_schema.avro # deleted: contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_recursive_schema.avro
        Hide
        Santhosh Srinivasan added a comment -

        I ran all the unit test cases and for Hadoop23, there are 2 failures and 1 error. I verified that these failures and error were not related to this patch by reproducing them on the latest source from trunk.

        ~/src/apache/pig/trunk/contrib/piggybank/java/build/test/logs $ grep Failures TEST-org.apache.pig.piggybank.test.* | grep -v "Failures: 0"
        TEST-org.apache.pig.piggybank.test.storage.TestDBStorage.txt:Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 8.462 sec
        TEST-org.apache.pig.piggybank.test.storage.TestMultiStorage.txt:Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 7.989 sec
        
        ~/src/apache/pig/trunk/contrib/piggybank/java/build/test/logs $ grep Errors TEST-org.apache.pig.piggybank.test.* | grep -v "Errors: 0"
        TEST-org.apache.pig.piggybank.test.evaluation.string.TestLookupInFiles.txt:Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 8.041 sec
        
        

        The patch and the updated binaries for unit tests along with the deletions are now committed.

        Thanks Cheolsoo.

        Show
        Santhosh Srinivasan added a comment - I ran all the unit test cases and for Hadoop23, there are 2 failures and 1 error. I verified that these failures and error were not related to this patch by reproducing them on the latest source from trunk. ~/src/apache/pig/trunk/contrib/piggybank/java/build/test/logs $ grep Failures TEST-org.apache.pig.piggybank.test.* | grep -v "Failures: 0" TEST-org.apache.pig.piggybank.test.storage.TestDBStorage.txt:Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 8.462 sec TEST-org.apache.pig.piggybank.test.storage.TestMultiStorage.txt:Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 7.989 sec ~/src/apache/pig/trunk/contrib/piggybank/java/build/test/logs $ grep Errors TEST-org.apache.pig.piggybank.test.* | grep -v "Errors: 0" TEST-org.apache.pig.piggybank.test.evaluation.string.TestLookupInFiles.txt:Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 8.041 sec The patch and the updated binaries for unit tests along with the deletions are now committed. Thanks Cheolsoo.
        Hide
        Santhosh Srinivasan added a comment -

        Patch reviewed and committed.

        Show
        Santhosh Srinivasan added a comment - Patch reviewed and committed.
        Hide
        Cheolsoo Park added a comment -

        Thanks Santhosh.

        I opened PIG-2966 for the piggybank test failures.

        Show
        Cheolsoo Park added a comment - Thanks Santhosh. I opened PIG-2966 for the piggybank test failures.
        Hide
        Cheolsoo Park added a comment -

        @Santhosh,
        I think that you omitted two files:
        contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testMultipleSchemas1.avro
        contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testMultipleSchemas2.avro

        TestAvroStorage is failing due to missing files. Can you please commit them to trunk and brach-0.11?

        Thanks!

        Show
        Cheolsoo Park added a comment - @Santhosh, I think that you omitted two files: contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testMultipleSchemas1.avro contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testMultipleSchemas2.avro TestAvroStorage is failing due to missing files. Can you please commit them to trunk and brach-0.11? Thanks!
        Hide
        Santhosh Srinivasan added a comment -

        My apologies on missing out on adding these files. I have committed both of them to trunk and branch-0.11. Cheolsoo, thanks for pointing it out.

        Show
        Santhosh Srinivasan added a comment - My apologies on missing out on adding these files. I have committed both of them to trunk and branch-0.11. Cheolsoo, thanks for pointing it out.
        Hide
        Cheolsoo Park added a comment -

        @Santhosh,
        Thank you very much.

        Btw, regarding the other test failures, I realized that I was hitting MAPREDUCE-3933, and I was able to fix them by setting MALLOC_ARENA_MAX to 4 on my CentOS 6 VM. Please see PIG-2966.

        But I cannot reproduce the same failures on my Mac, which seems correct as this issue is CentOS-6-specific. Are you setting JAVA_HOME=`/usr/libexec/java_home` on your Mac? As far as I understand, those 3 tests are only ones that use MiniCluster, and not setting JAVA_HOME will make them fail.

        Show
        Cheolsoo Park added a comment - @Santhosh, Thank you very much. Btw, regarding the other test failures, I realized that I was hitting MAPREDUCE-3933 , and I was able to fix them by setting MALLOC_ARENA_MAX to 4 on my CentOS 6 VM. Please see PIG-2966 . But I cannot reproduce the same failures on my Mac, which seems correct as this issue is CentOS-6-specific. Are you setting JAVA_HOME=`/usr/libexec/java_home` on your Mac? As far as I understand, those 3 tests are only ones that use MiniCluster, and not setting JAVA_HOME will make them fail.

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Stan Rosenberg
          • Votes:
            3 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development