Hadoop Common
  1. Hadoop Common
  2. HADOOP-5539

o.a.h.mapred.Merger not maintaining map out compression on intermediate files

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.19.1
    • Fix Version/s: 0.20.1
    • Component/s: None
    • Labels:
      None
    • Environment:

      0.19.2-dev, r753365

    • Hadoop Flags:
      Reviewed

      Description

      hadoop-site.xml :
      mapred.compress.map.output = true

      map output files are compressed but when the in memory merger closes
      on the reduce the on disk merger runs to reduce input files to <= io.sort.factor if needed.

      when this happens it outputs files called intermediate.x files these
      do not maintain compression setting the writer (o.a.h.mapred.Merger.class line 432)
      passes the codec but I added some logging and its always null map output compression set true or false.

      This causes task to fail if they can not hold the uncompressed size of the data of the reduce its holding
      I thank this is just and oversight of the codec not getting set correctly for the on disk merges.

      2009-03-20 01:30:30,005 INFO org.apache.hadoop.mapred.Merger: Merging 30 intermediate segments out of a total of 3000
      2009-03-20 01:30:30,005 INFO org.apache.hadoop.mapred.Merger: intermediate.1 used codec: null
      

      I added

                // added my me
      	   if (codec != null){
      	     LOG.info("intermediate." + passNo + " used codec: " + codec.toString());
      	   } else {
      	     LOG.info("intermediate." + passNo + " used codec: Null");
      	   }
      	   // end added by me
      

      Just before the creation of the writer o.a.h.mapred.Merger.class line 432
      and it outputs the second line above.

      I have confirmed this with the logging and I have looked at the files on the disk of the tasktracker. I can read the data in
      the intermediate files clearly telling me that there not compressed but I can not read the map.out files direct from the map output
      telling me the compression is working on the map end but not on the on disk merge that produces the intermediate.

      I can see no benefit for these not maintaining the compression setting and as it looks they where intended to maintain it.

      1. 5539.patch
        5 kB
        Billy Pearson
      2. hadoop-5539.patch
        6 kB
        Jothi Padmanabhan
      3. hadoop-5539-v1.patch
        5 kB
        Jothi Padmanabhan
      4. hadoop-5539-branch20.patch
        4 kB
        Jothi Padmanabhan

        Activity

        Hide
        Stefan Will added a comment -

        I'd like to see this fixed as well since one reason I've enabled map output compression is to reduce disk space usage by the mapreduce framework. It appears that currently the map outputs are simply decompressed as soon as they have been downloaded by the reducer.

        Show
        Stefan Will added a comment - I'd like to see this fixed as well since one reason I've enabled map output compression is to reduce disk space usage by the mapreduce framework. It appears that currently the map outputs are simply decompressed as soon as they have been downloaded by the reducer.
        Hide
        Billy Pearson added a comment -

        I verified that it is doing the same on map task no intermediate.x file from o.a.h.mapred.Merger are getting compressed.

        Show
        Billy Pearson added a comment - I verified that it is doing the same on map task no intermediate.x file from o.a.h.mapred.Merger are getting compressed.
        Hide
        Billy Pearson added a comment - - edited

        this should fix the problem I had to make a few new constructors. I left the old constructors that these files where using because not sure if any other tasks using these. this patch will apply to 0.19-branch I have not worked any on trunk so might need to try dry-run before applying to trunk. tested on my end and working correctly now with this patch.

        Show
        Billy Pearson added a comment - - edited this should fix the problem I had to make a few new constructors. I left the old constructors that these files where using because not sure if any other tasks using these. this patch will apply to 0.19-branch I have not worked any on trunk so might need to try dry-run before applying to trunk. tested on my end and working correctly now with this patch.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12403378/5539.patch
        against trunk revision 756858.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/117/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12403378/5539.patch against trunk revision 756858. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/117/console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12403378/5539.patch
        against trunk revision 756858.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/118/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12403378/5539.patch against trunk revision 756858. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/118/console This message is automatically generated.
        Hide
        Billy Pearson added a comment -

        added back to read map

        Show
        Billy Pearson added a comment - added back to read map
        Hide
        Billy Pearson added a comment - - edited

        Someone can use my patch as a starting point.

        The ReduceTask.java call that is the problem is line 2145
        The maptask.java call that is the problem is line 1269

        I use streaming without a combiner so that should be looked at also to see if it uses o.a.h.mapred.Merger

        the basic problem is the codec is not passed from these function to the merger so its always null the call to
        o.a.h.mapred.Merger should include codec somehow if compression is not used then codec is null
        in both ReduceTask and Maptask.

        I thank this is a major bug that effects all MR jobs with disk bandwidth that uses compression.

        Show
        Billy Pearson added a comment - - edited Someone can use my patch as a starting point. The ReduceTask.java call that is the problem is line 2145 The maptask.java call that is the problem is line 1269 I use streaming without a combiner so that should be looked at also to see if it uses o.a.h.mapred.Merger the basic problem is the codec is not passed from these function to the merger so its always null the call to o.a.h.mapred.Merger should include codec somehow if compression is not used then codec is null in both ReduceTask and Maptask. I thank this is a major bug that effects all MR jobs with disk bandwidth that uses compression.
        Hide
        Chris Douglas added a comment -

        added back to read map

        It's a blocker; it will be resolved and backported to 0.20 at least. The road map isn't; the PA queue defines the set of patches that can be committed. The fix version is usually set when it's actually resolved, so where it was committed is documented.

        Show
        Chris Douglas added a comment - added back to read map It's a blocker; it will be resolved and backported to 0.20 at least. The road map isn't; the PA queue defines the set of patches that can be committed. The fix version is usually set when it's actually resolved, so where it was committed is documented.
        Hide
        Chris Douglas added a comment -

        Oh, I see; the patch is for 0.19. My mistake.

        Show
        Chris Douglas added a comment - Oh, I see; the patch is for 0.19. My mistake.
        Hide
        Jothi Padmanabhan added a comment -

        The patch looks good. A few minor points:

        1. The new MergeQueue constructor could call the existing constructor and then set the codec later.
          public MergeQueue(Configuration conf, FileSystem fs, 
                      List<Segment<K, V>> segments, RawComparator<K> comparator,
                      Progressable reporter, boolean sortSegments, CompressionCodec codec) {
                    this(conf, fs, segments, comparator, reporter, sortSegments);
                    this.codec = codec;
                  }
          
        1. For the new merge methods, should we place the Codec argument after the valueClass argument (instead of being the last argument) to maintain consistency with the other method that does take the codec argument?

        Would you be able to provide patches for trunk and 20-branch as well?

        Show
        Jothi Padmanabhan added a comment - The patch looks good. A few minor points: The new MergeQueue constructor could call the existing constructor and then set the codec later. public MergeQueue(Configuration conf, FileSystem fs, List<Segment<K, V>> segments, RawComparator<K> comparator, Progressable reporter, boolean sortSegments, CompressionCodec codec) { this (conf, fs, segments, comparator, reporter, sortSegments); this .codec = codec; } For the new merge methods, should we place the Codec argument after the valueClass argument (instead of being the last argument) to maintain consistency with the other method that does take the codec argument? Would you be able to provide patches for trunk and 20-branch as well?
        Hide
        Billy Pearson added a comment -

        I got to many thing going on right now to make a new patch fill free to mod my patch to work the way you want and use it to build a patch for trunk I would like to see this fixed in 0.20.1 if at all possible. this will be the one thing holding me up from upgrading to hbase 0.20 when it becomes ready.

        Show
        Billy Pearson added a comment - I got to many thing going on right now to make a new patch fill free to mod my patch to work the way you want and use it to build a patch for trunk I would like to see this fixed in 0.20.1 if at all possible. this will be the one thing holding me up from upgrading to hbase 0.20 when it becomes ready.
        Hide
        Jothi Padmanabhan added a comment -

        Updated the patch to trunk

        Show
        Jothi Padmanabhan added a comment - Updated the patch to trunk
        Hide
        Jothi Padmanabhan added a comment -

        Could somebody review this patch? Thanks.

        Show
        Jothi Padmanabhan added a comment - Could somebody review this patch? Thanks.
        Hide
        Ravi Gummadi added a comment -

        Patch looks good.

        This patch clashes with HADOOP-5572. Need to update this patch once HADOOP-5572 gets committed.

        HADOOP-5572 changes mergeParts() to call merge() with boolean sortSegments ----- this avoids one new signature of merge() from your patch.

        Show
        Ravi Gummadi added a comment - Patch looks good. This patch clashes with HADOOP-5572 . Need to update this patch once HADOOP-5572 gets committed. HADOOP-5572 changes mergeParts() to call merge() with boolean sortSegments ----- this avoids one new signature of merge() from your patch.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12408232/hadoop-5539.patch
        against trunk revision 776352.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12408232/hadoop-5539.patch against trunk revision 776352. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/console This message is automatically generated.
        Hide
        Jothi Padmanabhan added a comment -

        Patch updated to trunk

        Show
        Jothi Padmanabhan added a comment - Patch updated to trunk
        Hide
        Ravi Gummadi added a comment -

        Patch looks good.
        +1

        Show
        Ravi Gummadi added a comment - Patch looks good. +1
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12408652/hadoop-5539-v1.patch
        against trunk revision 777761.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/386/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/386/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/386/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/386/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12408652/hadoop-5539-v1.patch against trunk revision 777761. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/386/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/386/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/386/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/386/console This message is automatically generated.
        Hide
        Jothi Padmanabhan added a comment -

        Patch for the 20 branch

        Show
        Jothi Padmanabhan added a comment - Patch for the 20 branch
        Hide
        Devaraj Das added a comment -

        I just committed this. Thanks Jothi and Billy!

        Show
        Devaraj Das added a comment - I just committed this. Thanks Jothi and Billy!
        Hide
        Nigel Daley added a comment -

        Why no unit test? Why no javadoc for new methods?

        If you tested this manually, what steps did you perform?

        Show
        Nigel Daley added a comment - Why no unit test? Why no javadoc for new methods? If you tested this manually, what steps did you perform?
        Hide
        Billy Pearson added a comment -

        no commit for 0.19 branch?

        Show
        Billy Pearson added a comment - no commit for 0.19 branch?
        Hide
        Jothi Padmanabhan added a comment -

        Why no unit test? If you tested this manually, what steps did you perform?

        It is pretty difficult to write a unit test for this patch as this patch just enables compression during intermediate merges. The files that are created during the intermediate merges are consumed soon after they are created and the final merged file was compressed even without this patch. I did the same test as Billy had done – add print statements in the framework code (Merger.java) to verify if compression was turned on during intermediate merges.

        Why no javadoc for new methods?

        The newly added methods are in Merger, which is a mapred package private class

        no commit for 0.19 branch?

        Billy, from this comment https://issues.apache.org/jira/browse/HADOOP-5539?focusedCommentId=12708570&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12708570, we thought you needed this only for 0.20. If you need it for 0.19 branch as well, I can generate a patch for that too.

        Show
        Jothi Padmanabhan added a comment - Why no unit test? If you tested this manually, what steps did you perform? It is pretty difficult to write a unit test for this patch as this patch just enables compression during intermediate merges. The files that are created during the intermediate merges are consumed soon after they are created and the final merged file was compressed even without this patch. I did the same test as Billy had done – add print statements in the framework code (Merger.java) to verify if compression was turned on during intermediate merges. Why no javadoc for new methods? The newly added methods are in Merger, which is a mapred package private class no commit for 0.19 branch? Billy, from this comment https://issues.apache.org/jira/browse/HADOOP-5539?focusedCommentId=12708570&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12708570 , we thought you needed this only for 0.20. If you need it for 0.19 branch as well, I can generate a patch for that too.
        Hide
        Billy Pearson added a comment -

        No I do not need it my version patch with my original patch for 0.19 but other might sense there is still a lot of older version in production that will update to 0.19 branch now that it has a few minor releases on it.

        Show
        Billy Pearson added a comment - No I do not need it my version patch with my original patch for 0.19 but other might sense there is still a lot of older version in production that will update to 0.19 branch now that it has a few minor releases on it.
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #863 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/863/ )

          People

          • Assignee:
            Jothi Padmanabhan
            Reporter:
            Billy Pearson
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development