|
Doug Cutting made changes - 13/May/08 04:36 PM
Would be useful for the RawComparator to be Configurable (or something similar) so that it can be configured during runtime.
This would be very useful for multi-part keys that need to arbitrarily sort on different positions. Under my proposal above, one would create a compator with:
RawComparator c = new SerializationFactory(conf).getSerialization(MyKey.class).getComparator(); So a configuration would be involved, and a serialization framework could in theory support configurable comparators. On the other hand, doing so efficiently might be hard. One could, e.g., implement JavaSerialization#getComparator() to read a configuration parameter that names a list of fields and use introspection to order things by those fields. Ideally it would generate comparator code and compile it on the fly, but that's a lot of work. Record IO provides a single generated comparator that's efficient but not parameterized. Thrift doesn't (yet) even generate comparators! Ideally IDL-generated serializers might generate a general-purpose parameterized comparator, e.g., compare(int[] fieldIds), where {1,-3} might mean to order by increasing values of the first field and decreasing values of the third. For text input (e.g., tab-separated), one could easily write a configurable comparator. We could use the serialization framework to associate a Serialization for String that does that. Would that suffice for now? > So a configuration would be involved [...]
That was a little glib. We need to make Serialization a Configurable, so that SerializationFactory can pass the configuration down to the Serialization implementation, which can then use it as it pleases. I think we ought to do this too, as a part of this issue, in order to support configurable comparators, etc. Ok, last comment makes good sense.
What's the relationship between this proposal and JobConf#getOutputValueGroupingComparator() and JobConf#getOutputKeyComparator()? Is it just a replacement of WritableComparator,get(getMapOutputKeyClass())? > What's the relationship between this proposal and JobConf#getOutputValueGroupingComparator() and JobConf#getOutputKeyComparator()?
Those are ways to override the "natural" (or default) comparator under MapReduce. This proposal is about defining the natural comparator. If we had a good configurable comparator, then we perhaps wouldn't need those methods, but I'm not sure... The framework might set io.comparator.context=grouping, and then the configurable comparator implementation could use this to decide to use the user-specified value of io.record.compare.grouping or somesuch. Yuck! BTW, those methods should both be altered to return RawComparator, not a WritableComparator, no? > those methods should both be altered to return RawComparator, not a WritableComparator, no?
Oops. They already have been. Nevermind! > BTW, those methods should both be altered to return RawComparator, not a WritableComparator, no?
I expect so. Consider a key of type Tuple (a ComparableWritable type) that holds an arbitrary list of ComparableWritable instances. If I want fine grained ability to compare/sort these keys based on a runtime configuration, I think I would be happy with providing a Configurable RawComparator class to the JobConf during job setup. Or are you suggesting best practice is to register a new TupleSerialization (that could subclass WritableSerialization and return my fancy TupleComparator). Or should I have a TupleSerialization decorator that delegates to a configurable 'base' Serialization (Text, Thrift, Writable, JSON, etc) but overrides Serialization#getComparator()? Sorry, just trying to wrap my head around the proposed changes and their implications... I still need to poke around and see the relationship with FileInput/OutputFormat classes... Ok, I think I get the gist of the new Serialization stuff.
If I'm correct, I only would need RawComparator to be Configurable. And I can continue to override the comparator via JobConf. I don't see that I would need to implement Serialization, since Writable is suitable... With the introduction of serialization framework, the need for RawComparator is somewhat broken.
In theory an object of some type (for example Double) can be serialized to its byte[] form in an arbitrary way by different serializers, so it is not possible to efficiently compare two byte arrays w/o actually deserializing the objects. Although some objects, especially writables, can precisely know how it is serialized and thus can benefit from raw byte comparison(in short we should keep RawComparator) Similarly the returned RawComparators returned by Serialization#getComparator() cannot do much except deserializing the objects and calling o1.compareTo(o2) (see DeserializerComparator and JavaSerializationComparator). I think we should
thoughts ? Enis,
I think the fact that the raw comparators depend on the serialization used is precisely why Doug wants to put it there. It seemed wrong to me at first, but it is growing on me. One very unfortunate part of using raw comparators is that Hadoop doesn't have a reasonable story if the user wants to do object-based compares. (Ignoring the utterly non-performant raw comparator that deserializes the two keys and calls compare on them.)
Yes, but my point is that, the default comparator returned by some serialization cannot do much except for deserializing the objects and calling compareTo on them. Is this assumption not correct? In either case, the developer has to write its own comparator for a specific class, under a known serialization. If we want to allow different raw comparators for different serializations (of the same class), then we may define the API like : RawComparator c = new SerializationFactory(conf).getSerialization(MyKey.class).getComparator(MyKey.class);
Note that getComparator() takes the class as an argument so that it can return a registered comparator for that class, if any, if not it can return the default(deserializing) comparator. If we do not want to allow different raw comparators, then wouldn't the attached (half-baked) patch be enough ?
Enis Soztutar made changes - 20/May/08 08:03 AM
Enis Soztutar made changes - 20/May/08 08:15 AM
> Note that getComparator() takes the class as an argument [ ... ]
That makes sense, but I had assumed that the Serialization could keep a pointer to the class and then use that in its implementation of getComparator() to look up a registered comparator. Applications would not need to pass a class to Serialization#getComparator(), since Serializeation is already parameterized by class. I thought that an object of Serialization class captures the semantics of a Serialization abstraction(such as writable). What I mean is that :
serializationFactory.getSerialization(IntWritable) equals serializationFactory.getSerialization(DoubleWritable) In that case, we need to pass the class object to getComparator(). Anyway, aside from this, what is the benefit of tying getComparator() to serialization instead of a stand alone DefaultComparator#get(ClassName) method (as in the attached patch)? > serializationFactory.getSerialization(IntWritable) equals serializationFactory.getSerialization(DoubleWritable)
No, that's not required and not the case with the current implementation. > Anyway, aside from this, what is the benefit of tying getComparator() to serialization [...] A RawComparator compares serialized data, not objects.
Hong Tang made changes - 26/Jan/09 10:05 PM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RawComparator Serialiation#getComparator();
Serialization is an interface, so this would be an incompatible change. We should make Serialzation an abstract class at the same time, so that we can modify it further in the future without breaking implementations.
We should then implement this method in WritableSerialization and JavaSerialization, the existing Serialization implementations.