Pig
  1. Pig
  2. PIG-2195

AvroStorage fails to STORE when LOADing via PigStorage

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      AvroStorage support for using a schema from an Avro schema file.

      Description

      Reading data via PigStorage and writing it via AvroStorage fails with an exception like this

      java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be cast to org.apache.avro.generic.IndexedRecord

      The Pig script in this section of the documentation shows an example like this that fails:

      http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data#AvroStorage-PigsupportforAvrodata-A.Howtostoredataindifferentways.

      A workaround currently exists to produce avro from TSVs like this:

      avro = LOAD 'inputPath/' AS (foo);
      STORE avro INTO 'outputPath/' USING oap.piggybank.storage.avro.AvroStorage(
        '{"data":"data_file.avro",
          "same":"data_file.avro", "field0":"def:bar"}');
      

      This is redundant though and data and same seem to indicate the same thing. This approach also requires an existing avro data file to exist. This patch will make the following alternate constructor syntax's work as well.

      1. Read schema from an existing data file:
          '{"data":"data_file.avro", "field0":"def:bar"}');
        
      2. Read schema from an existing schema file:
          '{"schema_file":"data_file.avsc", "field0":"def:bar"}');
        
      1. expected_testRecordSplitFromText1.avro
        0.2 kB
        Bill Graham
      2. expected_testRecordSplitFromText2.avro
        0.3 kB
        Bill Graham
      3. PIG-2195_1.patch
        20 kB
        Bill Graham

        Activity

        Hide
        Bill Graham added a comment -

        After spending many hours getting my head around what AvroStorage does, I've learned a few things some of which lead me to amend my initial assessment of the issue:

        1. Reading from PigStorage and writing via AvroStorage works as advertised when using Avro 1.4.1. The ClassCastException I initially referenced happens when using Avro 1.5.1. Also the current trunk doesn't build against 1.5.1. We can address this in a separate JIRA.
        2. There is a nasty hidden bug that causes the unit tests to not run in isolation. This can cause newly added unit tests or re-ordered tests to fail. The issue is that PigSchema2Avro has a static int tupleIndex that it increments after each call to getRecordName() and the tests expect record names to have a certain tupleIndex value included. As a temporary hack I've added a public static void setTupleIndex(int index) method to that class to allow unit tests to reset it, but this static int approach should really be revisited.
        3. I've added additional unit tests for reading from a text file and producing Avro.
        4. I've added support for passing a schema_file value instead of a data file as shown above.

        I'll upload a patch shortly.

        Show
        Bill Graham added a comment - After spending many hours getting my head around what AvroStorage does, I've learned a few things some of which lead me to amend my initial assessment of the issue: Reading from PigStorage and writing via AvroStorage works as advertised when using Avro 1.4.1. The ClassCastException I initially referenced happens when using Avro 1.5.1. Also the current trunk doesn't build against 1.5.1. We can address this in a separate JIRA. There is a nasty hidden bug that causes the unit tests to not run in isolation. This can cause newly added unit tests or re-ordered tests to fail. The issue is that PigSchema2Avro has a static int tupleIndex that it increments after each call to getRecordName() and the tests expect record names to have a certain tupleIndex value included. As a temporary hack I've added a public static void setTupleIndex(int index) method to that class to allow unit tests to reset it, but this static int approach should really be revisited. I've added additional unit tests for reading from a text file and producing Avro. I've added support for passing a schema_file value instead of a data file as shown above. I'll upload a patch shortly.
        Hide
        Bill Graham added a comment -

        Attached is a first patch that has:

        • More unit tests reading from text files.
        • A fix to how unit tests are run as described above.
        • Support for specifying a JSON schema_file.

        Also included are 2 new expected test result files that are needed for one of the new tests. The should live here:

        contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files

        Show
        Bill Graham added a comment - Attached is a first patch that has: More unit tests reading from text files. A fix to how unit tests are run as described above. Support for specifying a JSON schema_file . Also included are 2 new expected test result files that are needed for one of the new tests. The should live here: contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files
        Hide
        Bill Graham added a comment -

        This first patch is ready for review FYI. Also FYI, I've created PIG-2202 to track upgrading AvroStorage to work with Avro 1.5.1 and PIG-2201 to track a separate issue with the unit tests.

        Show
        Bill Graham added a comment - This first patch is ready for review FYI. Also FYI, I've created PIG-2202 to track upgrading AvroStorage to work with Avro 1.5.1 and PIG-2201 to track a separate issue with the unit tests.
        Hide
        Alan Gates added a comment -

        I'll be reviewing this patch.

        Show
        Alan Gates added a comment - I'll be reviewing this patch.
        Hide
        Bill Graham added a comment -

        Thanks Alan. Regarding the naming of the new schema_file field, with the recent commit of AVRO-872, Avro will supports parsing multiple schema files in 1.6.0. I'd like to contribute a patch to AvroStorage to support that at some point.

        With that in mind, should schema_file instead be schema_files so it can at some point accept a comma-separated list perhaps? Or we could allow multiple files to be passed to schema_file at some point. Or we can later introduce schema_files with an s. Many options, but something to plan for now.

        Show
        Bill Graham added a comment - Thanks Alan. Regarding the naming of the new schema_file field, with the recent commit of AVRO-872 , Avro will supports parsing multiple schema files in 1.6.0. I'd like to contribute a patch to AvroStorage to support that at some point. With that in mind, should schema_file instead be schema_files so it can at some point accept a comma-separated list perhaps? Or we could allow multiple files to be passed to schema_file at some point. Or we can later introduce schema_files with an s . Many options, but something to plan for now.
        Hide
        Alan Gates added a comment -

        Patch checked in. Thanks Bill.

        Show
        Alan Gates added a comment - Patch checked in. Thanks Bill.

          People

          • Assignee:
            Bill Graham
            Reporter:
            Bill Graham
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development