Issue Details (XML | Word | Printable)

Key: HADOOP-2302
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Devaraj Das
Reporter: Lohit Vijayarenu
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Streaming should provide an option for numerical sort of keys

Created: 28/Nov/07 11:04 PM   Updated: 08/Jul/09 05:05 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 2302-new.patch 2008-07-31 04:54 AM Devaraj Das 52 kB
Text File Licensed for inclusion in ASF works 2302-new.patch 2008-07-30 07:46 PM Devaraj Das 51 kB
Text File Licensed for inclusion in ASF works 2302-new1.patch 2008-07-31 01:02 PM Devaraj Das 52 kB
Text File Licensed for inclusion in ASF works 2302-new2.patch 2008-08-04 07:43 PM Devaraj Das 52 kB
Text File Licensed for inclusion in ASF works 2302.1.patch 2008-07-22 07:02 PM Devaraj Das 41 kB
Text File Licensed for inclusion in ASF works 2302.3.patch 2008-07-24 07:52 PM Devaraj Das 43 kB

Hadoop Flags: Reviewed
Release Note: Introduced numerical key comparison for streaming.
Resolution Date: 07/Aug/08 12:09 PM


 Description  « Hide
It would be good to have an option for numerical sort of keys for streaming.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
arkady borkovsky added a comment - 11/Dec/07 01:04 AM
This issue should probably be generalized to
"Streaming should provide more powerful specification primary keys comparator"
(meaning the comparator used for splits).

It should allow at least
– numeric comparison
– reverse order comparison
– multiple field comparison

One possible way to specify the comparison in the streaming command line is to use the familiar syntax of Unix sort command, like
"-k2,2rn -k1,1"
for "compare the second field, numerically, large value first; if equal, compare the first field, alphabetically"

Note that this specification implicitly defines the part of the string that is the key for shuffling purposes


Devaraj Das added a comment - 24/Jun/08 06:45 PM
Would supporting the basic unix/gnu sort options in the comparator work:
-f, (Ignore-case)
-n, (Sort numerically)
-r, (Reverse the result of comparison)
-k pos1[,pos2], where pos is of the form f[.c][opts], where f is the number of the field to use, and c is the number of the first character from the beginning of the field. Fields and character positions are numbered starting with 1; a character position of zero in pos2 indicates the field's last character. If '.c' is omitted from pos1, it defaults to 1 (the beginning of the field); if omitted from pos2, it defaults to 0 (the end of the field). opts are ordering options (any of fnr as described above).

We assume that the fields in the key are separated by map.output.key.field.separator (already exists).

Do we need anything else?

Also, this could be done in a Java comparator implementation that the user specifies to the framework via mapred.output.key.comparator.class. This comparator would be used by both sort and merge.


Arkady Borkovsky added a comment - 24/Jun/08 07:49 PM
I think this describes the needs very well.

Devaraj Das added a comment - 22/Jul/08 07:02 PM
A reasonably well tested patch. The following is done:
1) Options supported are -n (numeric comparison) and -r (reverse the result of comparison). So for e.g., one could say "-k1.1,1.2 -k2.1,2.3n -k2.4,3.1nr" as the value of mapred.text.key.comparator.job option (that the comparator understands).
2) Some refactoring is done - I needed access to the findBytes method defined in Streaming.UTF8ByteArrayUtils. But since this comparator implementation need not be dependent on the Streaming package, I made a new class org.apache.hadoop.util.UTF8ByteArrayUtils and filled that up with the "real" bytearray util methods. A few Streaming specific methods like findTab also exists in the Streaming.UTF8ByteArrayUtils. I moved them to a new class called Streaming.StreamKeyValUtil. All in all, i introduced two new classes and deprecated Streaming.UTF8ByteArrayUtils.
3) There is a partitioner function defined that would take a hash of just the portions of the keys that the user is interested in (using the same spec as the one defined for the comparator).

A note - the numCompare method in the comparator may be slightly verbose in terms of the code but that should help readability of the code.


Hadoop QA added a comment - 22/Jul/08 10:43 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386652/2302.1.patch
against trunk revision 678845.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 4 new or modified tests.

-1 javadoc. The javadoc tool appears to have generated 1 warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 4 new Findbugs warnings.

-1 release audit. The applied patch generated 211 release audit warnings (more than the trunk's current 210 warnings).

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2923/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2923/artifact/trunk/current/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2923/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2923/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2923/console

This message is automatically generated.


Devaraj Das added a comment - 24/Jul/08 07:52 PM
This patch fixes the hudson warnings

Hadoop QA added a comment - 25/Jul/08 07:49 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386822/2302.3.patch
against trunk revision 679601.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 4 new or modified tests.

-1 javadoc. The javadoc tool appears to have generated 1 warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2947/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2947/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2947/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2947/console

This message is automatically generated.


Arun C Murthy added a comment - 27/Jul/08 12:35 AM
Overall it looks fine, except I'd separate out the partitioner and the comparators i.e. split up UnixComparatorPartitioner. Also, it might be good to add a Hadoop Streaming test case too.

Devaraj Das added a comment - 30/Jul/08 07:46 PM
Fixes Arun's comment on splitting the comparator and partitioner. The streaming testcase seems redundant since i already have a testcase for the Java app, and the code path in the comparator would exactly be the same for a streaming app.

Hadoop QA added a comment - 31/Jul/08 12:04 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387221/2302-new.patch
against trunk revision 681223.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 5 new or modified tests.

-1 javadoc. The javadoc tool appears to have generated 1 warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

-1 release audit. The applied patch generated 213 release audit warnings (more than the trunk's current 212 warnings).

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2987/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2987/artifact/trunk/current/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2987/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2987/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2987/console

This message is automatically generated.


Arun C Murthy added a comment - 31/Jul/08 01:05 AM
+1, we just need to fix the javac and audit warnings...

Devaraj Das added a comment - 31/Jul/08 04:54 AM - edited
The release audit warning is fixed. The javadoc warning doesn't make sense to me.
trunk/src/contrib/streaming/src/java/org/apache/hadoop/streaming/UTF8ByteArrayUtils.java:210: warning - Tag @link: can't find readLine(LineReader, Text) in org.apache.hadoop.streaming.StreamKeyValUtil

The method org.apache.hadoop.streaming.StreamKeyValUtil.readLine with the proper javadoc does exist.


Hadoop QA added a comment - 31/Jul/08 09:48 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387255/2302-new.patch
against trunk revision 681243.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 5 new or modified tests.

-1 javadoc. The javadoc tool appears to have generated 1 warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2993/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2993/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2993/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2993/console

This message is automatically generated.


Devaraj Das added a comment - 31/Jul/08 01:02 PM
Sorry, had attached the incorrect patch earlier.

Devaraj Das added a comment - 31/Jul/08 01:06 PM
Submitting to hudson to get the tests through. javadoc is bound to fail for the reason mentioned earlier (pls let me know if anyone has an workaround for this or if i am missing something).

Hadoop QA added a comment - 31/Jul/08 06:39 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387272/2302-new1.patch
against trunk revision 681243.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 5 new or modified tests.

-1 javadoc. The javadoc tool appears to have generated 1 warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2998/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2998/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2998/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2998/console

This message is automatically generated.


Devaraj Das added a comment - 04/Aug/08 07:43 PM
Thanks to Sharad for helping me debug the javadoc issue. The issue was that LineReader is an inner class of LineRecordReader and I had to qualify LineReader as LineRecordReader.LineReader in the javadoc link.

Hadoop QA added a comment - 06/Aug/08 10:41 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387506/2302-new2.patch
against trunk revision 682978.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 5 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3014/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3014/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3014/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3014/console

This message is automatically generated.


Devaraj Das added a comment - 07/Aug/08 12:09 PM
I just committed this.

Hudson added a comment - 22/Aug/08 12:34 PM

Robert Chansler added a comment - 21/Oct/08 10:35 PM
Introduced numerical key comparison for streaming. The key specification of the form -k pos1[,pos2], is supported, where,
  • pos is of the form f[.c][opts], where f is the number of the key field to use, and c is the number of the first character from the beginning of the field. Fields and character positions are numbered starting with 1; a character position of zero in pos2 indicates the field's last character. If '.c' is omitted from pos1, it defaults to 1 (the beginning of the field); if omitted from pos2, it defaults to 0 (the end of the field). opts are ordering options. The supported options are:
  • -n, (Sort numerically)
  • -r, (Reverse the result of comparison)
    Multiple key specifications can be provided to sort on multiple key fields (in order). The key fields are assumed to be separated using map.output.key.field.separator. Specifying only -n as an option will make the comparator treat the all key bytes as one field.

[Is this user guide in the streaming user documentation?]