Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      MapFileOutputFormat produces output data that is sorted locally in each part-NNNNN file - however, there is no easy way to iterate over keys from all parts in a globally ascending order.

      1. map-file-v4.patch
        6 kB
        Andrzej Bialecki
      2. map-file-v3.patch
        7 kB
        Andrzej Bialecki
      3. map-file-v2.patch
        7 kB
        Andrzej Bialecki

        Activity

        Hide
        Andrzej Bialecki added a comment -

        This patch adds a method:

        public Iterator<Entry<WritableComparable, Writable>> getIterator(Path path, Configuration conf)

        which you can use to iterate over key/value pairs in a globally ascending key order.

        Show
        Andrzej Bialecki added a comment - This patch adds a method: public Iterator<Entry<WritableComparable, Writable>> getIterator(Path path, Configuration conf) which you can use to iterate over key/value pairs in a globally ascending key order.
        Hide
        Doug Cutting added a comment -

        Instead of creating a new ByteArrayInputStream and DataInputStream per key and value, why not re-use a single DataInputBuffer, cached as a field in the iterator?

        Show
        Doug Cutting added a comment - Instead of creating a new ByteArrayInputStream and DataInputStream per key and value, why not re-use a single DataInputBuffer, cached as a field in the iterator?
        Hide
        Andrzej Bialecki added a comment -

        Avoid creating streams.

        Show
        Andrzej Bialecki added a comment - Avoid creating streams.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12375624/map-file.patch
        against trunk revision 619744.

        @author +1. The patch does not contain any @author tags.

        tests included -1. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs -1. The patch appears to introduce 1 new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1801/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1801/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1801/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1801/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12375624/map-file.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 1 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1801/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1801/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1801/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1801/console This message is automatically generated.
        Hide
        Doug Cutting added a comment -

        One more nit: the isFile() call in MapFilePathFilter is expensive. Is it really required? Isn't checking that the name is "part-" sufficient?

        Show
        Doug Cutting added a comment - One more nit: the isFile() call in MapFilePathFilter is expensive. Is it really required? Isn't checking that the name is "part-" sufficient?
        Hide
        Andrzej Bialecki added a comment -

        We want to be sure that we are dealing with a bunch of MapFile-s, which means the parts should be directories. A common mistake will be to attempt to process a SequenceFileOutputFormat data. Is there any less expensive method to check this?

        Show
        Andrzej Bialecki added a comment - We want to be sure that we are dealing with a bunch of MapFile-s, which means the parts should be directories. A common mistake will be to attempt to process a SequenceFileOutputFormat data. Is there any less expensive method to check this?
        Hide
        Doug Cutting added a comment -

        > Is there any less expensive method to check this?

        Yes. When looping over the results of listStatus(), check whether each is a directory or not.

        We should probably add a StatusFilter interface, since that's more general...

        Show
        Doug Cutting added a comment - > Is there any less expensive method to check this? Yes. When looping over the results of listStatus(), check whether each is a directory or not. We should probably add a StatusFilter interface, since that's more general...
        Hide
        Andrzej Bialecki added a comment -

        Updated patch - I removed the PathFilter, it wasn't worth the hassle.

        Show
        Andrzej Bialecki added a comment - Updated patch - I removed the PathFilter, it wasn't worth the hassle.
        Hide
        Owen O'Malley added a comment -

        Wouldn't it be easier to just use SequenceFile.Sorter.merge between the part-* files? It gives you a RawKeyValueIterator, but it would be easy to wrap it and make an object iterator.

        Show
        Owen O'Malley added a comment - Wouldn't it be easier to just use SequenceFile.Sorter.merge between the part-* files? It gives you a RawKeyValueIterator, but it would be easy to wrap it and make an object iterator.
        Hide
        Andrzej Bialecki added a comment -

        I believe that's exactly what the patch does The "easy" part that you refer to amounts to the bulk of the patch.

        Show
        Andrzej Bialecki added a comment - I believe that's exactly what the patch does The "easy" part that you refer to amounts to the bulk of the patch.
        Hide
        Andrzej Bialecki added a comment -

        Check this with Hudson.

        Show
        Andrzej Bialecki added a comment - Check this with Hudson.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12375644/map-file-v3.patch
        against trunk revision 619744.

        @author +1. The patch does not contain any @author tags.

        tests included -1. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests -1. The patch failed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1853/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1853/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1853/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1853/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12375644/map-file-v3.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1853/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1853/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1853/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1853/console This message is automatically generated.
        Hide
        Enis Soztutar added a comment -

        We do not offer an Iterator for MapFiles, but use MapFile.Reader#next(). Wouldn't it be better if we (a) add Iterator to MapFile.Reader and apply this patch or (b) change this patch to define MapFileOutputFormat.Reader instead of Iterators, so that reading from MapFile and MapFileOutputFormat is consistent.

        one more minor issue : I think we should change the generics to :

        private static final class IteratorEntry<K extends WritableComparable, V extends Writable> implements Entry<K, V> {
        ...
        }
        
        private static final class MapFileOutputFormatIterator<K extends WritableComparable, V extends Writable> implements Iterator<Entry<K, V>> {
        ...
        }
        
        public static<K extends WritableComparable, V extends Writable> Iterator<Entry<K, V>>
              getIterator(Path dir, Configuration conf) throws IOException {
        ...
        }
        

        so that we can use :

        Iterator<Entry<Text, Text>> x = MapFileOutputFormat.getIterator(path, conf);
        
        Show
        Enis Soztutar added a comment - We do not offer an Iterator for MapFiles, but use MapFile.Reader#next(). Wouldn't it be better if we (a) add Iterator to MapFile.Reader and apply this patch or (b) change this patch to define MapFileOutputFormat.Reader instead of Iterators, so that reading from MapFile and MapFileOutputFormat is consistent. one more minor issue : I think we should change the generics to : private static final class IteratorEntry<K extends WritableComparable, V extends Writable> implements Entry<K, V> { ... } private static final class MapFileOutputFormatIterator<K extends WritableComparable, V extends Writable> implements Iterator<Entry<K, V>> { ... } public static <K extends WritableComparable, V extends Writable> Iterator<Entry<K, V>> getIterator(Path dir, Configuration conf) throws IOException { ... } so that we can use : Iterator<Entry<Text, Text>> x = MapFileOutputFormat.getIterator(path, conf);
        Hide
        Doug Cutting added a comment -

        Would this return a new Entry per key/value pair? Would these each contain new key/value instances? The existing API encourages object reuse, which can provide substantial performance improvements. Do we want to encourage developers to develop suboptimal code?

        Show
        Doug Cutting added a comment - Would this return a new Entry per key/value pair? Would these each contain new key/value instances? The existing API encourages object reuse, which can provide substantial performance improvements. Do we want to encourage developers to develop suboptimal code?
        Hide
        Andrzej Bialecki added a comment -

        The implementation in the latest patch doesn't create new Entries, it reuses a single instance. Though I agree that considering your doubt, and the already existing MapFile.Reader API, it would be more intuitive to create a MapFileOutputFormat.Reader, as Enis suggested.

        I'll prepare an updated patch.

        Show
        Andrzej Bialecki added a comment - The implementation in the latest patch doesn't create new Entries, it reuses a single instance. Though I agree that considering your doubt, and the already existing MapFile.Reader API, it would be more intuitive to create a MapFileOutputFormat.Reader, as Enis suggested. I'll prepare an updated patch.
        Hide
        Devaraj Das added a comment -

        Cancelling the patch since the updated patch hasn't been submitted yet.

        Show
        Devaraj Das added a comment - Cancelling the patch since the updated patch hasn't been submitted yet.
        Hide
        Andrzej Bialecki added a comment -

        I'm working on an updated patch, and I just realized that we can only iterate over entries (using RawKeyValueIterator), and we can't support some other methods in the Reader abstraction, methods that are available in other readers such as seek ... We can only implement next(), get(), reset() and close(). If people feel that this still falls under the Reader rather than Iterator I'll complete the patch that implements it.

        Show
        Andrzej Bialecki added a comment - I'm working on an updated patch, and I just realized that we can only iterate over entries (using RawKeyValueIterator), and we can't support some other methods in the Reader abstraction, methods that are available in other readers such as seek ... We can only implement next(), get(), reset() and close(). If people feel that this still falls under the Reader rather than Iterator I'll complete the patch that implements it.
        Hide
        Enis Soztutar added a comment -

        I think we can keep the iterator functionality but, name the class Reader. The Reader might not have seek, etc, but iterating the elements reusing Writables will be more intuitive in :

          MapFileOutputFormat.Reader  reader;
          K key; 
          V value;
          ...
          while(reader.next(key, value)) { 
          //use key, value
          }
        

        than :

        Iterator<Entry<K, V>> it = MapFileOutputFormat.getIterator(path, conf);
        while(it.hasNext()) {
          Entry<K, V> entry = it.next();
          // entry.getKey(); entry.getValue():
        }
        
        Show
        Enis Soztutar added a comment - I think we can keep the iterator functionality but, name the class Reader. The Reader might not have seek, etc, but iterating the elements reusing Writables will be more intuitive in : MapFileOutputFormat.Reader reader; K key; V value; ... while (reader.next(key, value)) { //use key, value } than : Iterator<Entry<K, V>> it = MapFileOutputFormat.getIterator(path, conf); while (it.hasNext()) { Entry<K, V> entry = it.next(); // entry.getKey(); entry.getValue(): }
        Hide
        Andrzej Bialecki added a comment -

        Patch changed to implement a Reader. JUnit test case added.

        Show
        Andrzej Bialecki added a comment - Patch changed to implement a Reader. JUnit test case added.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12379878/map-file-v4.patch
        against trunk revision 645773.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 4 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac -1. The applied patch generated 500 javac compiler warnings (more than the trunk's current 498 warnings).

        release audit -1. The applied patch generated 203 release audit warnings (more than the trunk's current 202 warnings).

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2200/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2200/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2200/artifact/trunk/build/test/checkstyle-errors.html
        Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2200/artifact/trunk/current/releaseAuditDiffWarnings.txt
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2200/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379878/map-file-v4.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 4 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 500 javac compiler warnings (more than the trunk's current 498 warnings). release audit -1. The applied patch generated 203 release audit warnings (more than the trunk's current 202 warnings). findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2200/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2200/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2200/artifact/trunk/build/test/checkstyle-errors.html Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2200/artifact/trunk/current/releaseAuditDiffWarnings.txt Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2200/console This message is automatically generated.
        Hide
        Owen O'Malley added a comment -

        I'm sorry, but you've got a few accidental white space diffs.

        I also think it would be a very good idea if your Reader.next validated that the types of the key and value that were passed in matched the types in the sequence file. Otherwise, the user can accidentally cross the streams and get random data corruption and run time errors.

        Show
        Owen O'Malley added a comment - I'm sorry, but you've got a few accidental white space diffs. I also think it would be a very good idea if your Reader.next validated that the types of the key and value that were passed in matched the types in the sequence file. Otherwise, the user can accidentally cross the streams and get random data corruption and run time errors.

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development