Here's an updated patch with a unit test and javadoc.
There's some code duplication with AvroInputFormat and AvroRecordReader which would be good to eliminate. Do we need to do this? It could be achieved by introducing a common superclass.
The eagle-eyed reviewer will notice that the test trims output lines, since the job introduces a trailing tab character on lines. I couldn't find a way of avoiding this.
- Changing the value type to NullWritable would fix the test, but makes the input format less useful for Streaming, since the input appears with a trailing "(null)" since this is the toString representation of NullWritable instances. (Arguably, Streaming should be fixed to special case NullWritables to ignore them.)
- I thought setting "mapred.textoutputformat.separator" to the empty string would be a workaround, but I found that this is interpreted as null, and hence the default value (a tab) is used. (When a Configuration is written to a file and then read back empty properties are read as null, not as empty strings. This is probably a bug - I haven't investigated further.)
- I thought changing the key to NullWritable and the value to Text might help, by using the ignore key feature in Streaming (
MAPREDUCE-1785). However, this is not desirable for a couple of reasons: it's not available pre-0.22; and also you lose out the sort by key, which is generally expected when using the identity map and reduce.