|
Would supporting the basic unix/gnu sort options in the comparator work:
-f, (Ignore-case) -n, (Sort numerically) -r, (Reverse the result of comparison) -k pos1[,pos2], where pos is of the form f[.c][opts], where f is the number of the field to use, and c is the number of the first character from the beginning of the field. Fields and character positions are numbered starting with 1; a character position of zero in pos2 indicates the field's last character. If '.c' is omitted from pos1, it defaults to 1 (the beginning of the field); if omitted from pos2, it defaults to 0 (the end of the field). opts are ordering options (any of fnr as described above). We assume that the fields in the key are separated by map.output.key.field.separator (already exists). Do we need anything else? Also, this could be done in a Java comparator implementation that the user specifies to the framework via mapred.output.key.comparator.class. This comparator would be used by both sort and merge. I think this describes the needs very well.
A reasonably well tested patch. The following is done:
1) Options supported are -n (numeric comparison) and -r (reverse the result of comparison). So for e.g., one could say "-k1.1,1.2 -k2.1,2.3n -k2.4,3.1nr" as the value of mapred.text.key.comparator.job option (that the comparator understands). 2) Some refactoring is done - I needed access to the findBytes method defined in Streaming.UTF8ByteArrayUtils. But since this comparator implementation need not be dependent on the Streaming package, I made a new class org.apache.hadoop.util.UTF8ByteArrayUtils and filled that up with the "real" bytearray util methods. A few Streaming specific methods like findTab also exists in the Streaming.UTF8ByteArrayUtils. I moved them to a new class called Streaming.StreamKeyValUtil. All in all, i introduced two new classes and deprecated Streaming.UTF8ByteArrayUtils. 3) There is a partitioner function defined that would take a hash of just the portions of the keys that the user is interested in (using the same spec as the one defined for the comparator). A note - the numCompare method in the comparator may be slightly verbose in terms of the code but that should help readability of the code. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386652/2302.1.patch against trunk revision 678845. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 4 new Findbugs warnings. -1 release audit. The applied patch generated 211 release audit warnings (more than the trunk's current 210 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2923/testReport/ This message is automatically generated. This patch fixes the hudson warnings
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386822/2302.3.patch against trunk revision 679601. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2947/testReport/ This message is automatically generated. Overall it looks fine, except I'd separate out the partitioner and the comparators i.e. split up UnixComparatorPartitioner. Also, it might be good to add a Hadoop Streaming test case too.
Fixes Arun's comment on splitting the comparator and partitioner. The streaming testcase seems redundant since i already have a testcase for the Java app, and the code path in the comparator would exactly be the same for a streaming app.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387221/2302-new.patch against trunk revision 681223. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 213 release audit warnings (more than the trunk's current 212 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2987/testReport/ This message is automatically generated. +1, we just need to fix the javac and audit warnings...
The release audit warning is fixed. The javadoc warning doesn't make sense to me.
trunk/src/contrib/streaming/src/java/org/apache/hadoop/streaming/UTF8ByteArrayUtils.java:210: warning - Tag @link: can't find readLine(LineReader, Text) in org.apache.hadoop.streaming.StreamKeyValUtil The method org.apache.hadoop.streaming.StreamKeyValUtil.readLine with the proper javadoc does exist. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387255/2302-new.patch against trunk revision 681243. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2993/testReport/ This message is automatically generated. Sorry, had attached the incorrect patch earlier.
Submitting to hudson to get the tests through. javadoc is bound to fail for the reason mentioned earlier (pls let me know if anyone has an workaround for this or if i am missing something).
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387272/2302-new1.patch against trunk revision 681243. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2998/testReport/ This message is automatically generated. Thanks to Sharad for helping me debug the javadoc issue. The issue was that LineReader is an inner class of LineRecordReader and I had to qualify LineReader as LineRecordReader.LineReader in the javadoc link.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387506/2302-new2.patch against trunk revision 682978. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3014/testReport/ This message is automatically generated. Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/
Introduced numerical key comparison for streaming. The key specification of the form -k pos1[,pos2], is supported, where,
[Is this user guide in the streaming user documentation?] |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
"Streaming should provide more powerful specification primary keys comparator"
(meaning the comparator used for splits).
It should allow at least
– numeric comparison
– reverse order comparison
– multiple field comparison
One possible way to specify the comparison in the streaming command line is to use the familiar syntax of Unix sort command, like
"-k2,2rn -k1,1"
for "compare the second field, numerically, large value first; if equal, compare the first field, alphabetically"
Note that this specification implicitly defines the part of the string that is the key for shuffling purposes