Issue Details (XML | Word | Printable)

Key: HADOOP-475
Type: New Feature New Feature
Status: Closed Closed
Resolution: Won't Fix
Priority: Major Major
Assignee: Vivek Ratan
Reporter: Runping Qi
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

The value iterator to reduce function should be clonable

Created: 24/Aug/06 09:17 PM   Updated: 08/Jul/09 04:51 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

Issue Links:
Blocker
 

Resolution Date: 11/Jul/07 12:15 AM


 Description  « Hide
In the current framework, when the user implements the reduce method of Reducer class,
the user can only iterate through the value iterator once.
This makes it hard for the user to perform join-like operations with in the reduce method.
To address problem, one approach is to make the input value iterator clonable. Then the user can iterate the values in different ways.
If the iterator can be reset, then the user can perform nested iterations over the data, thus
carry out join-likeoperations.

The user code in reduce method would be something like:

iterator1 = values.clone();
iterator2 = values.clone();
while (iterator1.hasNext()) {
val1 = iterator1.next();
iterator2.reset();
while (iterator2.hasNext()) { val2 = iterator.next(); do something vased on val1 and val2 ....................... }
}

One possible optimization is that if the values are sorted based on a secondary key,
the reset function can take a secondary key as an argument and reset the iterator to the begining
position of the secondary key. It will be very helpful if there is a utility that returns a list of iterators,
one per secondary key value, from the given iterator:

TreeMap getIteratorsBasedOnSecondaryKey(iterator);

Each entry in the returned map object is a pair of <secondary key, iterator for the values with the same secondary key>.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Vivek Ratan added a comment - 27/Jun/07 12:02 PM
Making the iterator cloneable seems clear enough.

For the optimization, are you suggesting that users be able to define any number of iterators over the values for a given key, each iterator based on a particular comparator function that compares two values? If you look at HADOOP-485, we allow users to provide two comparator functions. One works on composite keys and is used when sorting and merging in the Map phase, and when merging Map outputs in the Reduce phase. The other works on the basic key. These two together allow users to dictate the order of values within a given key, when their reduce function is called. It seems like what you're asking for is something more general than this: rather than allow just one ordering of values within a key, they can order values in many other ways. Is that right?

If so, this can probably be done just as well in user code: collect all the values for a given key, then sort them in multiple ways using different comparators.


Vivek Ratan added a comment - 27/Jun/07 12:53 PM
I wanted to expand on my previous comment. When I said "this can probably be done just as well in user code", I didn't necessarily imply that we let each user write his/her code to do this. What I was implying was that either we build a set of user-level classes (i.e., perhaps not part of the core platform conceptually, but still written by us) or we develop sample code and maybe let users copy from it. Seems to me like everytime you want someone to define a new iterator over a set of values, you need to clone the set of values and sort the copy using a different comparator, and then provide an iterator over it. This can be a bit tricky if the values don't all fit in memory - we'll need disk support for it. As Doug points out in one of his comments for HADOOP-485, we could maybe use SequenceFile for that.

But I think we first need to figure out how to present this to the user - what classes should they have, how will the functionality appear to them, etc. Runping, you should probably have some good insight into this.


Runping Qi added a comment - 27/Jun/07 02:51 PM

The data_join package in contrib provides user level implementation of this feature.
It is an in-memory solution for now. But it does allow to plugin disk-backed iterators.
It may not be as efficient as a framework backed solution. But it is simple and works well
in most common cases, and it does not introduce complexity in the framework.

In my opinion, it may be not worthy to pursure this Jira further. Rather, it is a low hang fruit to just provide a disk backed
iterator plugin to the data_join package.


Doug Cutting added a comment - 28/Jun/07 06:58 PM
I think the original intention was to just make the iterator cloneable, so that a reducer could iterate through the values more than once. It might first, e.g., count the total number of values, then, having this count, filter them or somesuch. This should work even when the values would not all fit in memory. Whether this feature is still interesting to folks I cannot say...

Vivek Ratan added a comment - 11/Jul/07 12:15 AM
As per Runping's comments, we don't need this functionality right away.