Description
Crunch has this great Pair class (https://crunch.apache.org/apidocs/0.14.0/org/apache/crunch/Pair.html) that saves one from constantly implementing composite writables. It seems silly that we still don't have an equivalent in MR.
I would like to see a new class with the following API:
package org.apache.hadoop.io; public class CompositeWritable<P extends WritableComparable, S extends WritableComparable> implements WritableComparable<CompositeWritable> { public CompositeWritable(P primary, S secondary); public P getPrimary(); public void setPrimary(P primary); public S getSecondary(); public void setSecondary(S secondary); // Return true if both primaries and both secondaries are equal public boolean equals(CompositeWritable o); // Return the primary's hash code public long hashCode(); // Sort first by primary and then by secondary public int compareTo(CompositeWritable o); public void readFields(DataInput in); public void write(DataOutput out); }
With such a class, implementing a secondary sort would mean just implementing a custom grouping comparator. That comparator could also be implemented as part of this JIRA:
package org.apache.hadoop.io; public class CompositeGroupingComparator extends WritableComparator { ... }
Or some such.
Crunch also provides Tuple3, Tuple4, and TupleN classes, but I don't think we need to add equivalents. If someone really wants that capability, they can nest composite keys.
Don't forget to add unit tests!