|
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12368110/2085.patch against trunk revision r588341. @author +1. The patch does not contain any @author tags. javadoc -1. The javadoc tool appears to have generated messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 4 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1003/testReport/ This message is automatically generated. Fixed findbugs warnings, addressed javadoc, changed Token type to accommodate Nicholas's feedback.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12368513/2085-2.patch against trunk revision r588778. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 3 new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1012/testReport/ This message is automatically generated. More findbugs, added an example
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12368550/2085-3.patch against trunk revision r588778. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1019/testReport/ This message is automatically generated. The JavaDoc for TupleWritable's class description isn't right. (It wasn't updated.)
The IOExceptions that are wrapping other errors should have descriptive string messages in them. You don't need to define a new ReflectionUtils.newInstance without the config, because if you pass in null, it won't use it. All of the instances the use cls.newInstance should be using ReflectionUtils.newInstance, since it does the constructor cache and handles the non-public class/constructor problems. The fieldnames m and n need descriptive names. I really don't like protected fields, especially when they are set/used multiple levels below where they are defined. Chris - can you help me understand how the splits work? This might be useful for some of our apps and trying to understand what assumptions etc. are being made.
we have sorted data files containing same sets of keys - but corresponding hdfs chunks of each file may not have the same set of keys. it wasnt clear to me from going through the patch how the merge-join is being parallelized. from node.getsplits - it seemed as if the ith split of the join record reader is composed of the ith split of each of the component files. but in this case - the join keys wouldn't line up .. also - given that the map task works on multiple hdfs files - where does it get scheduled? Joydeep-
The assumption it makes is precisely as you describe it: the ith split from each source must contain the same keys. It does only the most rudimentary verification of this, IIRC verifying that it received an equal number of splits from each source. Generally, getting splits should be cheap, so it doesn't verify key ranges for any of the splits (and probably ought not to). I've asked around, and "the way" out of this onerous constraint involves using MapFiles. At a high level, you need an index for your input data so your splits can be informed. I'm not familiar with the details here, I'm afraid. CompositeInputSplit::getLocations() returns an unweighted union of hosts from its child splits. It would be preferable to weight a host that contains multiple splits for a given composite split, but for now it provides a flat list. understood. i was thinking this might be using mapfiles or some kind of binary search to line up splits.
Dumb question - our data is laid out as files (representing partitions) within a single directory - with a directory representing a pseudo-table. Is this compatible with where u are going? ie. - can i represent join one such directory against another - with (say) an inputformat that emits each file as a split (and making sure the order is the same)? The other case is that sometimes one dataset is partitioned (say) 16 way - but another is partitioned 32 way. This can happen when datasets are of unequal size (otherwise we end up creating too many files). In the above case, 2 files from the latter dataset have to be joined against each file from the former (assuming simple modulo arithmetic partitioning). would this be possible? I don't know if this is helpful, but: as it exists now, the framework is incapable of finer granularity than an InputFormat, but neither will it object whatever you can fit into that framework.
What you describe- directories as pseudo-tables with files as partitions- sounds like exactly what this is geared toward. As an example of a workaround/partial fit, consider your 16/32 way case. Whether it would be worthwhile/possible to express in the existing code will depend on a few factors: if the two files you're joining in the 32-way set are pairwise disjoint, then you can simply use an OverrideRecordReader with two custom InputFormats (each taking one "half" of the pair) to "join" them. However, if they're not disjoint, then you'll lose values. 1 Feeding the output of that into a join with your 16 way dataset might work, but it's a bit of a hack. You'd need to be certain of the partitions of both datasets to be confident in your results. Notes Updated javadocs, made better use of ReflectionUtils, improved some variable names, made protected fields private or final (excluding the parser, which is temporary).
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12369206/2085-4.patch against trunk revision r592860. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1077/testReport/ This message is automatically generated. A few comments on the patch:
If mapred.join.expr is not specified, CompositeInputFormat should throw a better exception,, rather than NPE. A simple benchmark that we can use to compare performance with with the reducer-side joins is desirable. But, it can be a separate jira. The motivation and design that is in this jira should be in package.html for o.a.h.mapred.join. Addressed NPE and included interface info from this JIRA in o/a/h/mapred/join/package.html
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12369694/2085-5.patch against trunk revision r595563. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1110/testReport/ This message is automatically generated. +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12369694/2085-5.patch against trunk revision r595563. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1116/testReport/ This message is automatically generated. I just committed this. Thanks, Chris.
Integrated in Hadoop-Nightly #326 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/326/
Hey folks - i was thinking of a easier way to do merge joins that removes the need for any assumptions about equipartitioning. it can also work with sorted and non-sorted data sets. It's also extremely simple to implement:
If A and B are both sorted by the join column - then we are doing a pure merge-join in the reducer (maps will not sort). A and B don't need to be equipartitioned. It's of course not as efficient as the merge-join implemented here - but its also way more flexible .. (metadata about table sorting columns could be maintained outside and the map jobs configured based on whether sort and join columns match). Joydeep- excluding the optimization for not re-sorting A, it sounds like you're describing the join framework in contrib. The idea of metadata storing sorting columns, etc. is compelling, but a reasonable use of it would best be done by something like Pig, no? The most likely next step for more complex, reduce-side joins would be different map tasks for different datasets (e.g. emit result of operation w/ col 1,3 in B; identity for A, possibly in different formats sorted on whatever) followed by a join in the reduce. A sufficiently general execution engine- that could made decisions about whether or not the data is already sorted on some column, whether the join can happen on the map or reduce side, etc- belongs in framework code, I agree, but I'm less convinced it should live in this framework code.
We could change the idea of a job to include map tasks across multiple datasets- similar, yet very much unlike the join work in this patch- followed by a reduce step. To take your example, starting two different maps over A and B st the partitions are congruent (i.e. K1 in A and K1 in B go to the same partition), essentially providing different map classes for different input paths. Of course, all good ideas have JIRAs: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I tried to read your patch but still cannot fully understand it. It would be great if you can give an example (like WordCount) to show how to use the new codes.