This patch adds a 'JustBytesWritable' and supporting InputFormat, OutputFormat, InputWriter, and OutputReader to support passing raw, unmodified, unaugmented bytes through Hadoop streaming. The purpose is to be able to run arbitrary Unix filters on entire binary files stored in HDFS as map-only jobs, taking advantage of locality and reliability offered by Hadoop.
The code is very straightforward; most methods are only one line.
A few design notes:
1. Data is stored in a JustBytesWritable, which is the simplest possible Writable wrapper around a byte. It literally just reads until the buffer is full or EOF and remembers the number of bytes.
2. Data is read by JustBytesInputFormat in 64K chunks by default and stored in a JustBytesWritable key; the value is a NullWritable, but no value is ever read or written. They key is used instead of the value to allow the possibility of using it in a reduce.
3. Input files are never split, as most programs are not able to handle splits.
4. Input files are not decompressed, as the purpose is to get raw data to a program, people may want to operate on compressed data (e.g., md5sum on archives), and as most tools do not expect automatic decompression, this is the "least surprising" option. It's also trivial to throw a "zcat" in front of your filter.
5. Output is even simpler than input, and just writes the bytes of a JustBytesWritable key to the output stream. Output is never compressed, for similar reasons as above.
6. The code uses the old mapred API, as that is what streaming uses.
Streaming inserts an InputWriter between the InputFormat and the map executable, and an OutputReader between the map executable and the OutputFormat; the JustBytes version simply pass the key bytes on through.
I've augmented IdentifierResolver to recognize "-io justbytes" on the command line and set the input/output classes appropriately.
I've included a shell script called "mapstream" to run streaming with all required command line parameters; it makes running a binary map-only job as easy as:
mapstream indir command outdir
which runs "command" on every file in indir and writes the results to outdir.
I welcome feedback, especially if there is an even simpler way to do this. I'm not hung up on the JustBytes name, I'd be happy to switch to a better one. If people like the general approach, I will add unit tests and resubmit. Also please let me know if I should break this into separate patches for common and mapreduce.