Pig
  1. Pig
  2. PIG-2909

Add a new option for ignoring corrupted files to AvroStorage load func

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.11
    • Component/s: piggybank
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Currently, AvroStorage load fails with AvroRuntimeException when encountering corrupted input files. For example,

      ERROR 2997: Unable to recreate exception from backed error: java.io.IOException: org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
      	at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:283)
      

      But it is not always desirable to fail the Pig job for bad files. It is sometimes more useful to skip them and continue.

      1. PIG-2909.patch
        10 kB
        Cheolsoo Park
      2. PIG-2909-2.patch
        10 kB
        Cheolsoo Park
      3. PIG-2909-avro_test_files.tar.gz
        0.4 kB
        Cheolsoo Park

        Issue Links

          Activity

          Hide
          Cheolsoo Park added a comment -

          Attached is a patch that adds a new option 'ignoreBadFiles' to AvroStorage load. When it's enabled, AvroStorage load func skips bad files instead of failing.

          I am also attaching two test cases:

          • Load a corrupted avro file with 'ignoreBadFiles'.
          • Load a corrupted avro file without'ignoreBadFiles'.
          Show
          Cheolsoo Park added a comment - Attached is a patch that adds a new option 'ignoreBadFiles' to AvroStorage load. When it's enabled, AvroStorage load func skips bad files instead of failing. I am also attaching two test cases: Load a corrupted avro file with 'ignoreBadFiles'. Load a corrupted avro file without'ignoreBadFiles'.
          Hide
          Cheolsoo Park added a comment -
          Show
          Cheolsoo Park added a comment - Review board: https://reviews.apache.org/r/6940/
          Hide
          Alan Gates added a comment -

          A couple of small comments posted on review board. They're both suggestions that I won't insist on. Let me know if you want to make any modifications per my comments or go with what's here and I'll run the tests and check it in.

          Show
          Alan Gates added a comment - A couple of small comments posted on review board. They're both suggestions that I won't insist on. Let me know if you want to make any modifications per my comments or go with what's here and I'll run the tests and check it in.
          Hide
          Cheolsoo Park added a comment -

          Thank you very much sir! I agree with your suggestions, so please let me update my patch.

          Show
          Cheolsoo Park added a comment - Thank you very much sir! I agree with your suggestions, so please let me update my patch.
          Hide
          Cheolsoo Park added a comment -

          I updated my patch. I also added cleanupOnSuccess() to AvroStorage for PIG-1891.

          Show
          Cheolsoo Park added a comment - I updated my patch. I also added cleanupOnSuccess() to AvroStorage for PIG-1891 .
          Hide
          Alan Gates added a comment -

          Patch 2 plus new tests checked in. Thanks Cheolsoo.

          Show
          Alan Gates added a comment - Patch 2 plus new tests checked in. Thanks Cheolsoo.

            People

            • Assignee:
              Cheolsoo Park
              Reporter:
              Cheolsoo Park
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development