Issue Details (XML | Word | Printable)

Key: HADOOP-3380
Type: New Feature New Feature
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Doug Cutting
Votes: 0
Watchers: 6
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

need comparators in serializer framework

Created: 13/May/08 04:31 PM   Updated: 26/Jan/09 10:05 PM
Component/s: io
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works comparator_wip1.patch 2008-05-20 08:15 AM Enis Soztutar 44 kB
Text File Licensed for inclusion in ASF works comparator_wip1.patch 2008-05-20 08:03 AM Enis Soztutar 41 kB

Labels:


 Description  « Hide
The new serialization framework permits Hadoop to incorporate different serialization systems, including Hadoop's Writable, Thrift, Java Serialization, etc. It provides a generic, extensible means (SerializationFactory) to create serializers and deserializers for arbitrary Java classes. However it does not include a generic means to create comparators for these classes. Comparators are required for MapReduce keys and many other computations. Thus we should enhance the serialization framwork to provide comparators too.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doug Cutting added a comment - 13/May/08 04:34 PM
A simple way to add comparators would be to add the method:

RawComparator Serialiation#getComparator();

Serialization is an interface, so this would be an incompatible change. We should make Serialzation an abstract class at the same time, so that we can modify it further in the future without breaking implementations.

We should then implement this method in WritableSerialization and JavaSerialization, the existing Serialization implementations.


Doug Cutting made changes - 13/May/08 04:36 PM
Field Original Value New Value
Link This issue blocks HADOOP-3315 [ HADOOP-3315 ]
Chris K Wensel added a comment - 13/May/08 05:09 PM
Would be useful for the RawComparator to be Configurable (or something similar) so that it can be configured during runtime.

This would be very useful for multi-part keys that need to arbitrarily sort on different positions.


Doug Cutting added a comment - 13/May/08 05:38 PM
Under my proposal above, one would create a compator with:

RawComparator c = new SerializationFactory(conf).getSerialization(MyKey.class).getComparator();

So a configuration would be involved, and a serialization framework could in theory support configurable comparators. On the other hand, doing so efficiently might be hard. One could, e.g., implement JavaSerialization#getComparator() to read a configuration parameter that names a list of fields and use introspection to order things by those fields. Ideally it would generate comparator code and compile it on the fly, but that's a lot of work. Record IO provides a single generated comparator that's efficient but not parameterized. Thrift doesn't (yet) even generate comparators! Ideally IDL-generated serializers might generate a general-purpose parameterized comparator, e.g., compare(int[] fieldIds), where {1,-3} might mean to order by increasing values of the first field and decreasing values of the third.

For text input (e.g., tab-separated), one could easily write a configurable comparator. We could use the serialization framework to associate a Serialization for String that does that. Would that suffice for now?


Doug Cutting added a comment - 13/May/08 05:51 PM
> So a configuration would be involved [...]

That was a little glib. We need to make Serialization a Configurable, so that SerializationFactory can pass the configuration down to the Serialization implementation, which can then use it as it pleases. I think we ought to do this too, as a part of this issue, in order to support configurable comparators, etc.


Chris K Wensel added a comment - 13/May/08 06:02 PM
Ok, last comment makes good sense.

What's the relationship between this proposal and JobConf#getOutputValueGroupingComparator() and JobConf#getOutputKeyComparator()? Is it just a replacement of WritableComparator,get(getMapOutputKeyClass())?


Doug Cutting added a comment - 13/May/08 06:26 PM
> What's the relationship between this proposal and JobConf#getOutputValueGroupingComparator() and JobConf#getOutputKeyComparator()?

Those are ways to override the "natural" (or default) comparator under MapReduce. This proposal is about defining the natural comparator. If we had a good configurable comparator, then we perhaps wouldn't need those methods, but I'm not sure... The framework might set io.comparator.context=grouping, and then the configurable comparator implementation could use this to decide to use the user-specified value of io.record.compare.grouping or somesuch. Yuck!

BTW, those methods should both be altered to return RawComparator, not a WritableComparator, no?


Doug Cutting added a comment - 13/May/08 06:40 PM
> those methods should both be altered to return RawComparator, not a WritableComparator, no?

Oops. They already have been. Nevermind!


Chris K Wensel added a comment - 13/May/08 06:49 PM
> BTW, those methods should both be altered to return RawComparator, not a WritableComparator, no?

I expect so.

Consider a key of type Tuple (a ComparableWritable type) that holds an arbitrary list of ComparableWritable instances.

If I want fine grained ability to compare/sort these keys based on a runtime configuration, I think I would be happy with providing a Configurable RawComparator class to the JobConf during job setup.

Or are you suggesting best practice is to register a new TupleSerialization (that could subclass WritableSerialization and return my fancy TupleComparator).

Or should I have a TupleSerialization decorator that delegates to a configurable 'base' Serialization (Text, Thrift, Writable, JSON, etc) but overrides Serialization#getComparator()?

Sorry, just trying to wrap my head around the proposed changes and their implications... I still need to poke around and see the relationship with FileInput/OutputFormat classes...


Chris K Wensel added a comment - 13/May/08 07:17 PM
Ok, I think I get the gist of the new Serialization stuff.

If I'm correct, I only would need RawComparator to be Configurable. And I can continue to override the comparator via JobConf. I don't see that I would need to implement Serialization, since Writable is suitable...


Enis Soztutar added a comment - 14/May/08 11:34 AM
With the introduction of serialization framework, the need for RawComparator is somewhat broken.
In theory an object of some type (for example Double) can be serialized to its byte[] form in an arbitrary way by different serializers, so it is not possible to efficiently compare two byte arrays w/o actually deserializing the objects. Although some objects, especially writables, can precisely know how it is serialized and thus can benefit from raw byte comparison(in short we should keep RawComparator)
Similarly the returned RawComparators returned by Serialization#getComparator() cannot do much except deserializing the objects and calling o1.compareTo(o2) (see DeserializerComparator and JavaSerializationComparator).

I think we should

  1. not change Serialization interface
  2. introduce DefaultComparator extending DeserializerComparator, implementing Configurable, and with static register(Class, RawComparator) and get(Class) methods.
    DefaultComparator.get(Class keyClass) should check for registered Comparator instances for a given class, if unsuccessful, it should return itself, obtaining Deserializer by calling serializationFactory.getDeSerializer(c);
  3. replace usages of WritableComparator#define() with DefaultComparator#register(),
  4. WritableComparator extends DefaultComparator
  5. fix JobConf#getOutputValueGroupingComparator(), so that it uses DefaultComparator.
  6. depracate JavaSerializationComparator (since it is not needed once we have DefaultComparator extending DeserializerComparator)

thoughts ?


Owen O'Malley added a comment - 20/May/08 07:00 AM
Enis,
I think the fact that the raw comparators depend on the serialization used is precisely why Doug wants to put it there. It seemed wrong to me at first, but it is growing on me. One very unfortunate part of using raw comparators is that Hadoop doesn't have a reasonable story if the user wants to do object-based compares. (Ignoring the utterly non-performant raw comparator that deserializes the two keys and calls compare on them.)

Enis Soztutar added a comment - 20/May/08 08:03 AM

I think the fact that the raw comparators depend on the serialization used is precisely why Doug wants to put it there.

Yes, but my point is that, the default comparator returned by some serialization cannot do much except for deserializing the objects and calling compareTo on them. Is this assumption not correct? In either case, the developer has to write its own comparator for a specific class, under a known serialization.

If we want to allow different raw comparators for different serializations (of the same class), then we may define the API like :

RawComparator c = new SerializationFactory(conf).getSerialization(MyKey.class).getComparator(MyKey.class);

Note that getComparator() takes the class as an argument so that it can return a registered comparator for that class, if any, if not it can return the default(deserializing) comparator.

If we do not want to allow different raw comparators, then wouldn't the attached (half-baked) patch be enough ?


Enis Soztutar made changes - 20/May/08 08:03 AM
Attachment comparator_wip1.patch [ 12382359 ]
Enis Soztutar made changes - 20/May/08 08:15 AM
Attachment comparator_wip1.patch [ 12382360 ]
Doug Cutting added a comment - 20/May/08 04:21 PM
> Note that getComparator() takes the class as an argument [ ... ]

That makes sense, but I had assumed that the Serialization could keep a pointer to the class and then use that in its implementation of getComparator() to look up a registered comparator. Applications would not need to pass a class to Serialization#getComparator(), since Serializeation is already parameterized by class.


Enis Soztutar added a comment - 23/May/08 06:01 PM
I thought that an object of Serialization class captures the semantics of a Serialization abstraction(such as writable). What I mean is that :

serializationFactory.getSerialization(IntWritable) equals serializationFactory.getSerialization(DoubleWritable)

In that case, we need to pass the class object to getComparator().

Anyway, aside from this, what is the benefit of tying getComparator() to serialization instead of a stand alone DefaultComparator#get(ClassName) method (as in the attached patch)?


Doug Cutting added a comment - 23/May/08 06:25 PM
> serializationFactory.getSerialization(IntWritable) equals serializationFactory.getSerialization(DoubleWritable)

No, that's not required and not the case with the current implementation.

> Anyway, aside from this, what is the benefit of tying getComparator() to serialization [...]

A RawComparator compares serialized data, not objects.


Hong Tang made changes - 26/Jan/09 10:05 PM
Link This issue blocks HADOOP-3315 [ HADOOP-3315 ]