|
The whole point is that I would like to understand how Reduce job can output a file without any key values in it. The NullWritable seemed to be an ideal candidate for this but unfortunately I ran into exceptions when trying it. So I made a quick and dirty fix which is not meant to be a production ready (obviously NullWritable should not be special-cased in any way!).
On the other hand there seemed to be some questions which need to be asked and possible addressed. One of them is that ReflectionUtils is able to call any constructor after setAccessible is set to true but is this what we really want for singleton keys? And do we really need singleton keys at all? (I believe the answer is positive). How about size (length) of key value? Is it allowed to be zero? And why WritableComparato calls to newInstance method while this causes issues with any class having non-public constructor? These were the questions I was trying to acknowledge while learning Hadoop and trying to experiment with some simple jobs having no Reduce output key...
I'm sorry, I hadn't understood this. If you only want to output null keys from your reduce, then the RecordWriter used by your OutputFormat can encode or ignore null keys (e.g. TextOutputFormat). SequenceFiles, as you discovered, explicitly disallow zero-length keys, so you'll need to pick a different binary file format to store output records. Glancing at the code, this constraint is inconsistently enforced, and not for any particular reason that I can discern. Adapting SequenceFile to handle zero-length keys might be as simple as allowing zero-length keys from the Writers, since the Reader looks like it could handle it.
There's already a fair amount of object reuse. We need an object to deserialize into per the Writable contract, so a registration system like the one in WritableComparator would be necessary in ReflectionUtils to make singletons work (i.e. a map of classes to instances checked before the map of classes to constructors). Other than NullWritable, all of the sane use cases I can think of are just badly designed, but there are likely good ones.
It depends on where in the framework you're looking. The OutputFormat defines how to encode/handle null/NullWritable keys from the reduce (or the map if you're running without reduces). In 0.17, intermediate data is stored in SequenceFiles, so zero-length keys can't be emitted from the map. In 0.18, zero-length keys are supported, but their semantics are kind of odd. In most cases, emitting NullWritable keys from the map is not a scalable design.
Most WritableComparable types use RawComparator, which provides much better performance while rendering this consideration irrelevant. Unfortunately, WritableComparator creates new instances of its internal keys whether it requires them or not! This is easily remedied. This patch does the following:
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385009/3665-0.patch against trunk revision 672976. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2775/testReport/ This message is automatically generated. Corrected WritableComparator creation in SequenceFile::Sorter
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385023/3665-1.patch against trunk revision 672976. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2776/testReport/ This message is automatically generated. I'm ok with this patch, though making WritableComparator a Configurable might be a stretch... given that we don't have and Configurable Writables. But I'm happy to bow down to the wisdom of the crowd.
I think Arun is right; the approach in 3665-1 wouldn't have accomplished what it set out to do, anyway. Attached a patch removing the Configurable stuff.
I tested 3665-2 with my specific use case (NullWritable in reduced key) and it is working fine! Thanks guys.
BTW: Shouldn't this issue be renamed to something more descriptive?
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385142/3665-2.patch against trunk revision 673445. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2784/testReport/ This message is automatically generated. Looks good. I think we should just go ahead and mark this as an incompatible change for the fact that the key1/key2 fields aren't constructed for the protected WritableComparator constructor, a break from the past.
Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Still there are few questions/notes opened I think:
1) There is no clear way how to learn whether particular WritableComparator can be instantiated or not. There should be some new interface introduced for singletons.
2) It is not clear if key value can be of zero length. For NullWritable it make sense but it has nothing to do with the fact that it is a singleton I think.
3) I found other two classes implementing WritableComparable having private or protected constructor. These are ID and TaskID. Since they are both WritableComparable they may be used as a key (I can not tell if there is any point in such use case but technically it should be possible). This patch DOES NOT fixes such use case in any way.
Giving these notes together I can see the following conclusion:
Going forward it should be possible to learn if WritableComparable:
a) can be used as a key
b) can have zero length
c) can be instantiated using constructor and if not then what is the way to get the instance.