Pig
  1. Pig
  2. PIG-2614

AvroStorage crashes on LOADING a single bad error

    Details

      Description

      AvroStorage dies when a single bad record exists, such as one with missing fields. This is very bad on 'big data,' where bad records are inevitable. See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

      1. PIG-2614_0.patch
        29 kB
        Jonathan Coveney
      2. PIG-2614_1.patch
        29 kB
        Jonathan Coveney
      3. PIG-2614_2.patch
        19 kB
        Cheolsoo Park
      4. test_avro_files.tar.gz
        0.6 kB
        Cheolsoo Park

        Issue Links

          Activity

          Hide
          Jonathan Coveney added a comment -

          Russell,

          In Elephant-bird, there is a key elephantbird.mapred.input.bad.record.threshold. For whatever reason I felt like doing this, so find attached a patch that adds the functionality you want (note that it includes PIG-2551, which is more or less good to go... only because that patch brings in a Counter helper).

          The default functionality does not change. On an error, it will die. However, there are not two keys that can be set:
          pig.piggybank.storage.avro.bad.record.threshold
          pig.piggybank.storage.avro.bad.record.min

          The former sets the acceptable ratio threshhold. The latter sets the minimum number of errors before it can error out.

          Here is where you come in:

          Currently, the only error I log is on "reader.next()." Are there any other cases where errors (at least, errors indicating a bad row) can be thrown? And on an error, what do you want to happen? Skip the row, or return null? It seems to make sense to me to skip the record (also, the number of records processed and the number of errors thrown is logged in a Hadoop counter now).

          Secondly, someone needs to make tests. It currently passes the tests, but that's because the default threshold and min are 0. I don't know what is and isn't a bad Avro file, though, so yeah. Hopefully the fact that I did the work implementing will motivate someone to add tests

          Show
          Jonathan Coveney added a comment - Russell, In Elephant-bird, there is a key elephantbird.mapred.input.bad.record.threshold. For whatever reason I felt like doing this, so find attached a patch that adds the functionality you want (note that it includes PIG-2551 , which is more or less good to go... only because that patch brings in a Counter helper). The default functionality does not change. On an error, it will die. However, there are not two keys that can be set: pig.piggybank.storage.avro.bad.record.threshold pig.piggybank.storage.avro.bad.record.min The former sets the acceptable ratio threshhold. The latter sets the minimum number of errors before it can error out. Here is where you come in: Currently, the only error I log is on "reader.next()." Are there any other cases where errors (at least, errors indicating a bad row) can be thrown? And on an error, what do you want to happen? Skip the row, or return null? It seems to make sense to me to skip the record (also, the number of records processed and the number of errors thrown is logged in a Hadoop counter now). Secondly, someone needs to make tests. It currently passes the tests, but that's because the default threshold and min are 0. I don't know what is and isn't a bad Avro file, though, so yeah. Hopefully the fact that I did the work implementing will motivate someone to add tests
          Hide
          Jonathan Coveney added a comment -

          works in both 0.10 and 0.11

          Show
          Jonathan Coveney added a comment - works in both 0.10 and 0.11
          Hide
          Russell Jurney added a comment -

          reader.next() is where I've had the problem, and I'm not aware of any other cases where this gets thrown.

          It makes sense to skip the record and adjust counters. Assuming the skipped/err'd/dropped record count will output prominently at the end?

          I can add tests. When are we releasing an RC?

          Show
          Russell Jurney added a comment - reader.next() is where I've had the problem, and I'm not aware of any other cases where this gets thrown. It makes sense to skip the record and adjust counters. Assuming the skipped/err'd/dropped record count will output prominently at the end? I can add tests. When are we releasing an RC?
          Hide
          Jonathan Coveney added a comment -

          Right now there is no plan to make it any more prominent than the fact that it is a Hadoop counter. I'm not sure how hard it is to print various stuff at the end... it's an interesting thought (to make it so functions can register information they want printed at the end). Either way, the counters + the fact that any threshholds are set by the user feel sufficient to me.

          There's a thread going around about the RC. There are a handful open...if you need this patch to be in the RC, just post in the thread (with any others). My guess would be sometime next week, to give you a chance to write tests, Bill to finish up some of his fixes, Julien a chance to look over JRuby stuff, etc. Right now am trying to get a final list of what will go into the RC so we stop stuffing the ballot box, as it were

          Show
          Jonathan Coveney added a comment - Right now there is no plan to make it any more prominent than the fact that it is a Hadoop counter. I'm not sure how hard it is to print various stuff at the end... it's an interesting thought (to make it so functions can register information they want printed at the end). Either way, the counters + the fact that any threshholds are set by the user feel sufficient to me. There's a thread going around about the RC. There are a handful open...if you need this patch to be in the RC, just post in the thread (with any others). My guess would be sometime next week, to give you a chance to write tests, Bill to finish up some of his fixes, Julien a chance to look over JRuby stuff, etc. Right now am trying to get a final list of what will go into the RC so we stop stuffing the ballot box, as it were
          Hide
          Russell Jurney added a comment -

          Thanks a lot for helping out. This feels like ONERROR Ok, I asked the LinkedIn guys to try 0.10/JRuby, fwiw.

          pig.piggybank.storage.avro.bad.record.threshold=0.99
          pig.piggybank.storage.avro.bad.record.min=100

          So I read these as 'don't die unless max(more than 1% of the records fail, 100 records fail)

          Correct?

          Show
          Russell Jurney added a comment - Thanks a lot for helping out. This feels like ONERROR Ok, I asked the LinkedIn guys to try 0.10/JRuby, fwiw. pig.piggybank.storage.avro.bad.record.threshold=0.99 pig.piggybank.storage.avro.bad.record.min=100 So I read these as 'don't die unless max(more than 1% of the records fail, 100 records fail) Correct?
          Hide
          Russell Jurney added a comment -

          Currently, it still dies after I apply the patch and set thresholds.

          java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64
          at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:275)
          at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:194)
          at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
          at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364)
          at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
          at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
          at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
          at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
          at org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
          at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
          at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
          at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
          at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
          at org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(PigAvroRecordReader.java:80)
          at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:273)
          ... 7 more

          Show
          Russell Jurney added a comment - Currently, it still dies after I apply the patch and set thresholds. java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:275) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:194) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532) at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) at org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220) at org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(PigAvroRecordReader.java:80) at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:273) ... 7 more
          Hide
          Russell Jurney added a comment -

          Ah, I see. The code is like this:

          @Override
          public Tuple getNext() throws IOException {
          try {
          if (!reader.nextKeyValue())

          { return null; }

          return (Tuple) this.reader.getCurrentValue();
          } catch (Exception e)

          { throw new IOException(e); }

          }

          So I need to remove these conditions.

          Show
          Russell Jurney added a comment - Ah, I see. The code is like this: @Override public Tuple getNext() throws IOException { try { if (!reader.nextKeyValue()) { return null; } return (Tuple) this.reader.getCurrentValue(); } catch (Exception e) { throw new IOException(e); } } So I need to remove these conditions.
          Hide
          Russell Jurney added a comment -

          Followup comment... it looks like you also need to do the check in reader.nextKeyValue()

          Show
          Russell Jurney added a comment - Followup comment... it looks like you also need to do the check in reader.nextKeyValue()
          Hide
          Russell Jurney added a comment -

          Followup^2 comment: I wish I could undo comments. Retesting with the actual jar I get this...

          Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory
          at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
          at java.security.AccessController.doPrivileged(Native Method)
          at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
          at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
          at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
          at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
          ... 7 more

          Adding org.slf4j.LoggerFactory tells me this jar is missing.

          Show
          Russell Jurney added a comment - Followup^2 comment: I wish I could undo comments. Retesting with the actual jar I get this... Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 7 more Adding org.slf4j.LoggerFactory tells me this jar is missing.
          Hide
          Russell Jurney added a comment -

          I added that class to the PigAvroRecordReader class and register all these jars and still it don't go:

          register /me/pig/build/ivy/lib/Pig/slf4j-api-1.6.1.jar
          register /me/pig/build/ivy/lib/Pig/slf4j-log4j12-1.6.1.jar
          register /me/pig/build/ivy/lib/Pig/log4j-1.2.16.jar
          register /me/pig/build/ivy/lib/Pig/commons-logging-1.1.1.jar

          Show
          Russell Jurney added a comment - I added that class to the PigAvroRecordReader class and register all these jars and still it don't go: register /me/pig/build/ivy/lib/Pig/slf4j-api-1.6.1.jar register /me/pig/build/ivy/lib/Pig/slf4j-log4j12-1.6.1.jar register /me/pig/build/ivy/lib/Pig/log4j-1.2.16.jar register /me/pig/build/ivy/lib/Pig/commons-logging-1.1.1.jar
          Hide
          Russell Jurney added a comment -

          I've tried adding slf4j to PigRecordReader and what not, but I'm stumped and am gonna kick this back at you, having tried and failed to use it

          Show
          Russell Jurney added a comment - I've tried adding slf4j to PigRecordReader and what not, but I'm stumped and am gonna kick this back at you, having tried and failed to use it
          Hide
          Jonathan Coveney added a comment -

          Thanks for taking a look. I'll probably just remove the dependence on the logger and rely on the counters. As far as the bounds, it's should be:

          pig.piggybank.storage.avro.bad.record.threshold=0.01
          pig.piggybank.storage.avro.bad.record.min=100

          the threshhold is the error/record ratio you're willing to tolerate.

          Do you think there is any chance that you could submit a dataset with some known bad rows and some known good rows? That would let me troubleshoot this. I'll fix the logging issue though.

          Show
          Jonathan Coveney added a comment - Thanks for taking a look. I'll probably just remove the dependence on the logger and rely on the counters. As far as the bounds, it's should be: pig.piggybank.storage.avro.bad.record.threshold=0.01 pig.piggybank.storage.avro.bad.record.min=100 the threshhold is the error/record ratio you're willing to tolerate. Do you think there is any chance that you could submit a dataset with some known bad rows and some known good rows? That would let me troubleshoot this. I'll fix the logging issue though.
          Hide
          Jonathan Coveney added a comment -

          Also, how are you setting these values? The proper way is, in a script:

          set pig.piggybank.storage.avro.bad.record.threshold 0.01
          set pig.piggybank.storage.avro.bad.record.min 100

          Show
          Jonathan Coveney added a comment - Also, how are you setting these values? The proper way is, in a script: set pig.piggybank.storage.avro.bad.record.threshold 0.01 set pig.piggybank.storage.avro.bad.record.min 100
          Hide
          Jonathan Coveney added a comment -

          Russell,

          This patch gets rid of the logging dependency. Let me know if you have any issues with it!

          Show
          Jonathan Coveney added a comment - Russell, This patch gets rid of the logging dependency. Let me know if you have any issues with it!
          Hide
          Russell Jurney added a comment -

          Thank you sir, testing now!

          Show
          Russell Jurney added a comment - Thank you sir, testing now!
          Hide
          Russell Jurney added a comment -

          I will write unit tests, but I'm having trouble getting this to work in my actual error case. In my case, when one record is bad, all subsequent records are bad... the offset screws up or something in the read.

          I'll make unit tests, though!

          Show
          Russell Jurney added a comment - I will write unit tests, but I'm having trouble getting this to work in my actual error case. In my case, when one record is bad, all subsequent records are bad... the offset screws up or something in the read. I'll make unit tests, though!
          Hide
          Russell Jurney added a comment -

          Hmmmm I'm not able to test this on real data. I don't know how to write bad avros that don't screw up all reads thereafter

          Show
          Russell Jurney added a comment - Hmmmm I'm not able to test this on real data. I don't know how to write bad avros that don't screw up all reads thereafter
          Hide
          Daniel Dai added a comment -

          If unit test is too hard, you can elaborate your manual test step, attach the dataset you are using.

          BTW, I also see there is some code not belonging to this issue in patch, can you cleanup?

          Show
          Daniel Dai added a comment - If unit test is too hard, you can elaborate your manual test step, attach the dataset you are using. BTW, I also see there is some code not belonging to this issue in patch, can you cleanup?
          Hide
          Jonathan Coveney added a comment -

          Daniel,

          The files you are referencing are from https://issues.apache.org/jira/browse/PIG-2551. I used PigCounterHelper. Those files will also be checked in with the JRuby patch (or we can just commit that one already, your call). I included the whole thing to be unambiguous, but can remove the other files.

          Russell,

          Perhaps an e2e test is better here? Is it that you can't get the function to work, or that you can't get data to work it on?

          Show
          Jonathan Coveney added a comment - Daniel, The files you are referencing are from https://issues.apache.org/jira/browse/PIG-2551 . I used PigCounterHelper. Those files will also be checked in with the JRuby patch (or we can just commit that one already, your call). I included the whole thing to be unambiguous, but can remove the other files. Russell, Perhaps an e2e test is better here? Is it that you can't get the function to work, or that you can't get data to work it on?
          Hide
          Dmitriy V. Ryaboy added a comment -

          Looks like Jon's all over this.. assigning to him.

          Would be nice to make this a global setting (and punt per-loader settings to ONERROR implementation)

          But that's outside the scope of this ticket, let's at least confirm this works.

          Russell, if you can post input that knocks you to the wrong offset, that could help with reproducing the error you are getting with the current patch.

          Show
          Dmitriy V. Ryaboy added a comment - Looks like Jon's all over this.. assigning to him. Would be nice to make this a global setting (and punt per-loader settings to ONERROR implementation) But that's outside the scope of this ticket, let's at least confirm this works. Russell, if you can post input that knocks you to the wrong offset, that could help with reproducing the error you are getting with the current patch.
          Hide
          Jonathan Coveney added a comment -

          Yeah, some bad data would be key... if the case with avro is that one bad row throws off the offset, I can deal with that, but it'll help immensely to have something to test against.

          Show
          Jonathan Coveney added a comment - Yeah, some bad data would be key... if the case with avro is that one bad row throws off the offset, I can deal with that, but it'll help immensely to have something to test against.
          Hide
          Cheolsoo Park added a comment -

          This is related to PIG-2909, where I introduced a new option for skipping corrupted Avro files. If an AvroRuntimeException is thrown in nextKeyValue(), the input file can be skipped. I also added a corrupted Avro file that causes AvroRuntimeException to a unit test case. PIG-2909 is already in 0.11/trunk.

          But I like this patch more since it increments a counter and lets the user configure the number of errors, etc.

          How about we incorporate this patch in PIG-3015? Because Joseph's new AvroStorage also has the -ignoreerrors option, we can do the same there. If everyone agrees, we can unlink this jira from 0.11.

          Thanks!

          Show
          Cheolsoo Park added a comment - This is related to PIG-2909 , where I introduced a new option for skipping corrupted Avro files. If an AvroRuntimeException is thrown in nextKeyValue() , the input file can be skipped. I also added a corrupted Avro file that causes AvroRuntimeException to a unit test case. PIG-2909 is already in 0.11/trunk. But I like this patch more since it increments a counter and lets the user configure the number of errors, etc. How about we incorporate this patch in PIG-3015 ? Because Joseph's new AvroStorage also has the -ignoreerrors option, we can do the same there. If everyone agrees, we can unlink this jira from 0.11. Thanks!
          Hide
          Joseph Adler added a comment -

          Just taking a look at this patch now

          If I'm reading the code correctly, it should not be necesarry for the author of a UDF (specifically a LoadFunc) to do anything special to take advantage of this functionality. Is that correct?

          Show
          Joseph Adler added a comment - Just taking a look at this patch now If I'm reading the code correctly, it should not be necesarry for the author of a UDF (specifically a LoadFunc) to do anything special to take advantage of this functionality. Is that correct?
          Hide
          Joseph Adler added a comment -

          Repeating an old question: is there any reason that this patch is only for Avro? I think this could work for all storage types.

          Show
          Joseph Adler added a comment - Repeating an old question: is there any reason that this patch is only for Avro? I think this could work for all storage types.
          Hide
          Cheolsoo Park added a comment -

          Hi all,

          I rebased the patch to trunk. Hopefully, this will make things more clear:

          • Removed PIG-2551 code since it's already committed to trunk.
          • Replaced the ignore_bad_file option that was committed in PIG-2909 with the bad.record.threshold and bad.record.min properties.
          • Added unit test cases
            testCorruptedFile1,2,3.

          @Joe,
          I am not sure if I fully understand your question. Please correct me if I am wrong.

          You're right that InputErrorTracker can be used by any LoadFunc. What storages need to do is to create a InputErrorTracker and increase counters. Do you have a better suggestion?

          Thanks!

          Show
          Cheolsoo Park added a comment - Hi all, I rebased the patch to trunk. Hopefully, this will make things more clear: Removed PIG-2551 code since it's already committed to trunk. Replaced the ignore_bad_file option that was committed in PIG-2909 with the bad.record.threshold and bad.record.min properties. Added unit test cases testCorruptedFile1,2,3 . @Joe, I am not sure if I fully understand your question. Please correct me if I am wrong. You're right that InputErrorTracker can be used by any LoadFunc. What storages need to do is to create a InputErrorTracker and increase counters. Do you have a better suggestion? Thanks!
          Hide
          Cheolsoo Park added a comment -

          In addition to applying the patch, the following commands should be also executed to run the unit test cases:

          wget https://issues.apache.org/jira/secure/attachment/12555546/test_avro_files.tar.gz
          tar -xf test_avro_files.tar.gz
          svn rm contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_corrupted_file.avro
          svn add contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_corrupted_file
          svn rm contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testCorruptedFile.avro
          svn add contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testCorruptedFile2.avro
          svn add contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testCorruptedFile3.avro
          

          Thanks!

          Show
          Cheolsoo Park added a comment - In addition to applying the patch, the following commands should be also executed to run the unit test cases: wget https: //issues.apache.org/jira/secure/attachment/12555546/test_avro_files.tar.gz tar -xf test_avro_files.tar.gz svn rm contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_corrupted_file.avro svn add contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_corrupted_file svn rm contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testCorruptedFile.avro svn add contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testCorruptedFile2.avro svn add contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testCorruptedFile3.avro Thanks!
          Hide
          Joseph Adler added a comment -

          Could I propose an alternative?

          I like this functionality, but I don't think that this should be specific to Avro records. I think that is should be straightforward to modify org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader to implement this functionality for ALL LoadFunc types. Specifically, it should be possible to count the number of Exceptions thrown by the getNext method in the underlying load function (inside PigRecordReader.nextKeyValue).

          Show
          Joseph Adler added a comment - Could I propose an alternative? I like this functionality, but I don't think that this should be specific to Avro records. I think that is should be straightforward to modify org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader to implement this functionality for ALL LoadFunc types. Specifically, it should be possible to count the number of Exceptions thrown by the getNext method in the underlying load function (inside PigRecordReader.nextKeyValue).
          Hide
          Jonathan Coveney added a comment -

          I imagine we could make this work for any RecordReader, yes. I think the difference is just that different RecordReaders fail differently...for some, you can recover, for others, you cannot. I have no problem making this more generic, though I do not have the bandwidth for it atm. I think we should have these tests, decide whether to commit this to Avro, then have another JIRA which will make it more general.

          Show
          Jonathan Coveney added a comment - I imagine we could make this work for any RecordReader, yes. I think the difference is just that different RecordReaders fail differently...for some, you can recover, for others, you cannot. I have no problem making this more generic, though I do not have the bandwidth for it atm. I think we should have these tests, decide whether to commit this to Avro, then have another JIRA which will make it more general.
          Hide
          Joseph Adler added a comment -

          I'd argue that it's not just true for different RecordReaders, but for different business problems. There are cases where it's OK to run a job even if 50% of your records are bad, and cases where an error on 1 in 1,000,000,000,000 records is unacceptable...

          How about this: I'll take the existing patch and modify it to apply to all record readers (not just AvroStorage). I'll see if I can get this to work more generally. I think it will be about the same amount of work to modify org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader as it would be to modify the rewritten AvroStorage, so why not just tackle the general problem?

          Show
          Joseph Adler added a comment - I'd argue that it's not just true for different RecordReaders, but for different business problems. There are cases where it's OK to run a job even if 50% of your records are bad, and cases where an error on 1 in 1,000,000,000,000 records is unacceptable... How about this: I'll take the existing patch and modify it to apply to all record readers (not just AvroStorage). I'll see if I can get this to work more generally. I think it will be about the same amount of work to modify org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader as it would be to modify the rewritten AvroStorage, so why not just tackle the general problem?
          Hide
          Russell Jurney added a comment -
          Show
          Russell Jurney added a comment - See PIG-3059
          Hide
          Cheolsoo Park added a comment -

          Can we resolve this in a more general way in PIG-3059 then? Joe, can you please attach your work to PIG-3059?

          As of now, AvroStorage has the ignoreBadFiles option, so I think we can move this out of 0.11. (Let's get Pig 0.11 out as soon as possible!)

          Please let me know if anyone has objections.

          Thanks!

          Show
          Cheolsoo Park added a comment - Can we resolve this in a more general way in PIG-3059 then? Joe, can you please attach your work to PIG-3059 ? As of now, AvroStorage has the ignoreBadFiles option, so I think we can move this out of 0.11. (Let's get Pig 0.11 out as soon as possible!) Please let me know if anyone has objections. Thanks!
          Hide
          Julien Le Dem added a comment -

          moving this to next release so that we can converge on pig 0.11

          Show
          Julien Le Dem added a comment - moving this to next release so that we can converge on pig 0.11
          Hide
          Cheolsoo Park added a comment -

          Closing it as PIG-3059 incorporates this.

          Show
          Cheolsoo Park added a comment - Closing it as PIG-3059 incorporates this.

            People

            • Assignee:
              Jonathan Coveney
              Reporter:
              Russell Jurney
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development