Hadoop Common
  1. Hadoop Common
  2. HADOOP-1851

Map output compression codec cannot be set independently of job output compression codec

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.14.1
    • Fix Version/s: 0.15.0
    • Component/s: None
    • Labels:
      None

      Description

      The property "mapred.output.compression.codec" is used when setting and getting the map output compression codec in JobConf, thus making it impossible to use a different codec for map outputs and overall job outputs.

      1. HADOOP-1851_3_20071002.patch
        17 kB
        Arun C Murthy
      2. HADOOP-1851_2_20070929.patch
        18 kB
        Arun C Murthy
      3. HADOOP-1851_1_20070929.patch
        19 kB
        Arun C Murthy

        Issue Links

          Activity

          Hide
          Riccardo Boscolo added a comment - - edited

          The methods that affect this behavior are in line 341-371 of JobConf in Hadoop 0.14.1:

          /**
          * Set the given class as the compression codec for the map outputs.
          * @param codecClass the CompressionCodec class that will compress the
          * map outputs
          */
          public void setMapOutputCompressorClass(Class<? extends CompressionCodec> codecClass)

          { setCompressMapOutput(true); setClass("mapred.output.compression.codec", codecClass, CompressionCodec.class); }

          /**
          * Get the codec for compressing the map outputs
          * @param defaultValue the value to return if it is not set
          * @return the CompressionCodec class that should be used to compress the
          * map outputs
          * @throws IllegalArgumentException if the class was specified, but not found
          */
          public Class<? extends CompressionCodec> getMapOutputCompressorClass(Class<? extends CompressionCodec> defaultValue) {
          String name = get("mapred.output.compression.codec");
          if (name == null)

          { return defaultValue; }

          else {
          try

          { return getClassByName(name).asSubclass(CompressionCodec.class); }

          catch (ClassNotFoundException e)

          { throw new IllegalArgumentException("Compression codec " + name + " was not found.", e); }

          }
          }

          This could be easily fixed by using a different property, for example, "map.output.compression.codec".

          Show
          Riccardo Boscolo added a comment - - edited The methods that affect this behavior are in line 341-371 of JobConf in Hadoop 0.14.1: /** * Set the given class as the compression codec for the map outputs. * @param codecClass the CompressionCodec class that will compress the * map outputs */ public void setMapOutputCompressorClass(Class<? extends CompressionCodec> codecClass) { setCompressMapOutput(true); setClass("mapred.output.compression.codec", codecClass, CompressionCodec.class); } /** * Get the codec for compressing the map outputs * @param defaultValue the value to return if it is not set * @return the CompressionCodec class that should be used to compress the * map outputs * @throws IllegalArgumentException if the class was specified, but not found */ public Class<? extends CompressionCodec> getMapOutputCompressorClass(Class<? extends CompressionCodec> defaultValue) { String name = get("mapred.output.compression.codec"); if (name == null) { return defaultValue; } else { try { return getClassByName(name).asSubclass(CompressionCodec.class); } catch (ClassNotFoundException e) { throw new IllegalArgumentException("Compression codec " + name + " was not found.", e); } } } This could be easily fixed by using a different property, for example, "map.output.compression.codec".
          Hide
          Arun C Murthy added a comment -

          As Riccardo says we clearly need another configuration property... how about:

          Map Outputs:
          mapred.map.output.compression.

          {type|codec}
          JobConf.{get|set}MapOutputCompression{Type|Codec}

          Job Outputs:
          mapred.output.compression.{type|codec}

          JobConf.

          {get|set}

          OutputCompression

          {Type|Codec}

          That's the easy bit, however I think that this will snow-ball into a large-ish patch since this will entail hunting down all OutputFormat implementations and ensuring that they use mapred.output.compression.

          {type|codec}

          properties... clearly something we have to fix. Of course, I'll ensure this is clearly documented for folks who write their own {{OutputFormat}}s.

          Show
          Arun C Murthy added a comment - As Riccardo says we clearly need another configuration property... how about: Map Outputs: mapred.map.output.compression. {type|codec} JobConf.{get|set}MapOutputCompression{Type|Codec} Job Outputs: mapred.output.compression.{type|codec} JobConf. {get|set} OutputCompression {Type|Codec} That's the easy bit, however I think that this will snow-ball into a large-ish patch since this will entail hunting down all OutputFormat implementations and ensuring that they use mapred.output.compression. {type|codec} properties... clearly something we have to fix. Of course, I'll ensure this is clearly documented for folks who write their own {{OutputFormat}}s.
          Hide
          Arun C Murthy added a comment - - edited

          Umm... clearly (as-per the new way)

          {set|get}CompressorClass belongs to the OutputFormatBase, does it mean I should put {set|get}

          CompressionType in SequenceFileOutputFormat? Yep, it makes sense only for

          {SequenceFile}

          s... equivalently I can move both set of apis,

          {set|get}CompressorClass & {set|get}

          CompressionType to JobConf.

          Thoughts? Doug?

          Show
          Arun C Murthy added a comment - - edited Umm... clearly (as-per the new way) {set|get}CompressorClass belongs to the OutputFormatBase , does it mean I should put {set|get} CompressionType in SequenceFileOutputFormat ? Yep, it makes sense only for {SequenceFile} s... equivalently I can move both set of apis, {set|get}CompressorClass & {set|get} CompressionType to JobConf. Thoughts? Doug?
          Hide
          Arun C Murthy added a comment -

          I also propose we deprecate OutputFormatBase.

          {get|set}CompressOutput and JobConf.{get|set}

          CompressMapOutput apis, with reasonable defaults for compression-type (NONE) and compression-codec (null) things should work just fine.

          Show
          Arun C Murthy added a comment - I also propose we deprecate OutputFormatBase. {get|set}CompressOutput and JobConf.{get|set} CompressMapOutput apis, with reasonable defaults for compression-type (NONE) and compression-codec (null) things should work just fine.
          Hide
          Doug Cutting added a comment -

          Yes, OutputFormat-specific parameters should not be accessed through JobConf, but rather through static methods on the appropriate OutputFormat class. So, yes, we should deprecate existing OutputFormat-specific methods of JobConf.

          Show
          Doug Cutting added a comment - Yes, OutputFormat-specific parameters should not be accessed through JobConf, but rather through static methods on the appropriate OutputFormat class. So, yes, we should deprecate existing OutputFormat-specific methods of JobConf.
          Hide
          Arun C Murthy added a comment -

          Here is an early patch for review while I continue testing...

          I'd really appreciate f/b since this introduces a fairly large no. of changes i.e. new config knobs, deprecates methods and adds new ones etc.

          Show
          Arun C Murthy added a comment - Here is an early patch for review while I continue testing... I'd really appreciate f/b since this introduces a fairly large no. of changes i.e. new config knobs, deprecates methods and adds new ones etc.
          Hide
          Arun C Murthy added a comment -

          Another go at this one...

          With this, one can do:

          $ hadoop jar build/hadoop-0.15.0-dev-examples.jar randomwriter -Dmapred.output.compress=true -Dmapred.output.compression.type=BLOCK -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec /rw/in
          
          hadoop jar build/hadoop-0.15.0-dev-examples.jar sort -Dmapred.output.compress=true -Dmapred.output.compression.type=BLOCK -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec -Dmapred.map.output.compress=true -Dmapred.map.output.compression.type=RECORD -Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec /rw/in /rw/out
          
          Show
          Arun C Murthy added a comment - Another go at this one... With this, one can do: $ hadoop jar build/hadoop-0.15.0-dev-examples.jar randomwriter -Dmapred.output.compress=true -Dmapred.output.compression.type=BLOCK -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec /rw/in hadoop jar build/hadoop-0.15.0-dev-examples.jar sort -Dmapred.output.compress=true -Dmapred.output.compression.type=BLOCK -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec -Dmapred.map.output.compress=true -Dmapred.map.output.compression.type=RECORD -Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec /rw/in /rw/out
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12366819/HADOOP-1851_2_20070929.patch
          against trunk revision r580487.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/853/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/853/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/853/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/853/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12366819/HADOOP-1851_2_20070929.patch against trunk revision r580487. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/853/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/853/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/853/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/853/console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -
          org.apache.hadoop.mapred.TestMiniMRWithDFS.testWithDFS
          Failing for the past 1 build (since Failed#853)
          
          java.net.SocketTimeoutException: Read timed out
          	at java.net.SocketInputStream.socketRead0(Native Method)
          	at java.net.SocketInputStream.read(SocketInputStream.java:129)
          	at java.net.SocketInputStream.read(SocketInputStream.java:182)
          	at java.io.DataInputStream.readShort(DataInputStream.java:284)
          	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:1641)
          	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1714)
          	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
          	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
          	at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:774)
          	at org.apache.hadoop.mapred.PiEstimator.launch(PiEstimator.java:170)
          	at org.apache.hadoop.mapred.TestMiniMRWithDFS.testWithDFS(TestMiniMRWithDFS.java:170)
          

          Seems unrelated to this patch, re-submitting.

          Show
          Arun C Murthy added a comment - org.apache.hadoop.mapred.TestMiniMRWithDFS.testWithDFS Failing for the past 1 build (since Failed#853) java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.net.SocketInputStream.read(SocketInputStream.java:182) at java.io.DataInputStream.readShort(DataInputStream.java:284) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:1641) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1714) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64) at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:774) at org.apache.hadoop.mapred.PiEstimator.launch(PiEstimator.java:170) at org.apache.hadoop.mapred.TestMiniMRWithDFS.testWithDFS(TestMiniMRWithDFS.java:170) Seems unrelated to this patch, re-submitting.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12366819/HADOOP-1851_2_20070929.patch
          against trunk revision r580811.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/856/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/856/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/856/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/856/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12366819/HADOOP-1851_2_20070929.patch against trunk revision r580811. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/856/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/856/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/856/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/856/console This message is automatically generated.
          Hide
          Doug Cutting added a comment -

          Renaming mapred.compress.map.output is not back-compatible. Perhaps we should continue to use this if it is set for back-compatibility? Or else don't rename.

          It also appears that all of the previous public constructors for MapFile are no longer present. If there are public constructors that are no longer used, these should be deprecated, not simply removed. Later we can remove the deprecated methods.

          Show
          Doug Cutting added a comment - Renaming mapred.compress.map.output is not back-compatible. Perhaps we should continue to use this if it is set for back-compatibility? Or else don't rename. It also appears that all of the previous public constructors for MapFile are no longer present. If there are public constructors that are no longer used, these should be deprecated, not simply removed. Later we can remove the deprecated methods.
          Hide
          Arun C Murthy added a comment -

          I've reverted the config parameter, and, as before, only added constructors to MapFile, no deletions.

          Also, I've run TestMiniMRWithDFS multiple times (on multiple machines, smile) - no issues.

          Show
          Arun C Murthy added a comment - I've reverted the config parameter, and, as before, only added constructors to MapFile , no deletions. Also, I've run TestMiniMRWithDFS multiple times (on multiple machines, smile ) - no issues.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12366938/HADOOP-1851_3_20071002.patch
          against trunk revision r581319.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/867/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/867/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/867/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/867/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12366938/HADOOP-1851_3_20071002.patch against trunk revision r581319. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/867/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/867/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/867/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/867/console This message is automatically generated.
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Arun!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Arun!
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Nightly #259 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/259/ )
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Nightly #312 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/312/ )

            People

            • Assignee:
              Arun C Murthy
              Reporter:
              Riccardo Boscolo
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development