Scott, thanks for the careful review!
> The above looks odd.
Yes, you're right, it was a buggy equals implementation. I replaced it with the one you provided. hashCode() is required to support hash-based MapReduce partitioning (the default) and I only provide an equals implementation to be consistent: it's not otherwise required here. Good catch.
> It would be nice if the compression level was configurable.
Yes, I meant to get to that but forgot. I've now added it. Thanks.
> This creates a new AvroWrapper for each output.collect().
Oops. I originally wrote it that way, but reverted it while debugging to remove a possibility but forgot to restore it. I've now restored it.
> AvroKeySerialization: I am a bit confused about this class.
It's used to serialize map outputs and deserialize reduce inputs. The mapreduce framework uses the job's specified map output key class to find the serialization implementation it uses to read and write intermediate keys and values.
> Deprecated APIs are used - are the replacements not appropriate or insufficient?
Good question. Hadoop 0.20 deprecated the "old" org.apache.hadoop.mapred APIs to encourage folks to try the new org.apache.hadoop.mapreduce APIs. However the org.apache.hadoop.mapreduce APIs are not fully functional in 0.20, and folks primarily continue to use the org.apache.hadoop.mapred APIs. 0.20 is used here since it's in Maven repos, but this code should also work against 0.19 and perhaps even 0.18, and I'd compile against one of those instead if it were in a Maven repo.