In Elephant-bird, there is a key elephantbird.mapred.input.bad.record.threshold. For whatever reason I felt like doing this, so find attached a patch that adds the functionality you want (note that it includes
PIG-2551, which is more or less good to go... only because that patch brings in a Counter helper).
The default functionality does not change. On an error, it will die. However, there are not two keys that can be set:
The former sets the acceptable ratio threshhold. The latter sets the minimum number of errors before it can error out.
Here is where you come in:
Currently, the only error I log is on "reader.next()." Are there any other cases where errors (at least, errors indicating a bad row) can be thrown? And on an error, what do you want to happen? Skip the row, or return null? It seems to make sense to me to skip the record (also, the number of records processed and the number of errors thrown is logged in a Hadoop counter now).
Secondly, someone needs to make tests. It currently passes the tests, but that's because the default threshold and min are 0. I don't know what is and isn't a bad Avro file, though, so yeah. Hopefully the fact that I did the work implementing will motivate someone to add tests