> the users aren't likely to change the serialization format from the default for the types they are using
There is no default for Long, ByteBuffer, Map<URL,byte>, unions, etc. These are reasonable values for mapreduce processing. Moreover, a given class might be serialized and compared in different ways by different jobs.
> we end up with 3 configuration knobs per a class instead of the current 1. There are 6 classes in the MR pipeline, that means before this is done we have 18 methods to set the class/schema/serializer for the various types.
I don't follow your math here. An application need call no more methods than it did before to configure a job.
> I thought that you were going to create a marker interface for AvroRecord that has a getSchema method.
That only works for the case when a single record is the top-level type. It does not work for arrays, maps, enums, unions or primitives, all reasonable values for mapreduce. Nor does it work where one might have a legacy Writable that one also sometimes wishes to process with Avro reflection. In the general case, how you serialize something is independent of its in-memory representation.
We use the marker interface when we can, but we cannot always. When a marker interface is appropriate, job configuration looks much as it did before: one can still set key and value classes.
> Notice that this adds another 5 types that we are going to want serialize (InputFormat, Mapper, Combiner, Reducer, OutputFormat) per a job. With the current proposal that means that we get 11 * 3 = 33 serialization methods.
I do not understand what you're counting here. Currently we specify job parameters with a configuration. In the future, we might change this to instead use serialized instances of five interfaces. If we do that, then we would declare a single mechanism by which these instances are serialized, just as we now declare the single mechanism by which a configuration is serialized.
> we need to use the global serialization/deserialization factory that we already have.
That's precisely what this patch does. It updates the shuffle to no longer use the serialization factory in a now-deprecated manner.
> moving the
Class methods is a non-starter.
This is an odd statement. We're forever forbidden from making the MapReduce framework more abstract?
> Why is the metadata a Map? I'd rather have it be an opaque blob that is serializable.
That was established in
HADOOP-6165. If you'd like to revisit that representation for metadata, that should be done as a separate issue. That API not been released, and so would be easy to change if there is consensus to do so. This current issue builds on the serialzation metadata mechanism currently in trunk and should be evaluated independently of the serialization metadata representation.
> we need to check on the map side whether the type the map is outputting is correct.
That check is now done by the serializer.
> you need to support union types in Writable too.
I don't understand this suggestion. A serialization system that implements unions should check unions, but a serialization system that does not support unions need not.
> i'm not wild about having Configuration.setMap. Having a function in StringUtils that takes a Map<String,String> map into and from a String seems more appropriate.
That was established in
HADOOP-6420. This API has not been released, and would be easy to change if there is consensus. If you wish to revisit that API, and have a better implementation idea, then that should be done in a separate issue, but that question should be considered separately from the current issue.