Hadoop Common
  1. Hadoop Common
  2. HADOOP-474

support compressed text files as input and output

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5.0
    • Fix Version/s: 0.6.0
    • Component/s: None
    • Labels:
      None

      Description

      I'd like TextInputFomat and TextOutputFormat to automatically compress and uncompress text files when they are read and written. Furthermore, I'd like to be able to use custom compressors as defined in HADOOP-441. Therefore, I propose:

      Adding a map of compression codecs in the server config files:

      io.compression.codecs = "<suffix>=<codec class>,..."

      so the default would be something like:

      <property>
      <name>io.compression.codecs</name>
      <value>.gz=org.apache.hadoop.io.GZipCodec,.Z=org.apache.hadoop.io.ZipCodec</value>
      <description>A list of file suffixes and the codecs for them.</description>
      </property>

      note that the suffix can include multiple "." so you could support suffixes like ".tar.gz", but they are just treated as literals against the end of the filename.

      If the TextInputFormat is dealing with such a file, it:
      1. makes a single split
      2. decompresses automatically

      On the output side, if mapred.output.compress is true, then TextOutputFormat would use a new property mapred.output.compression.codec that would define the codec to use to compress the outputs, defaulting to gzip.

      1. text-gz.patch
        41 kB
        Owen O'Malley
      2. text-gz-2.patch
        54 kB
        Owen O'Malley
      3. text-gz-3.patch
        59 kB
        Owen O'Malley

        Issue Links

          Activity

          Owen O'Malley created issue -
          Hide
          Doug Cutting added a comment -

          +1, with a few questions:

          Do we need io.compression.codecs if the codecs provide extensions? If so, then what's the point of having the codec's provide extensions?

          Should the output property be instead named mapred.text.output.compression.codec? In other words, will we use this same property to name the codecs for other output formats, or is this property for text files alone (in which case it should have 'text' in its name)?

          Show
          Doug Cutting added a comment - +1, with a few questions: Do we need io.compression.codecs if the codecs provide extensions? If so, then what's the point of having the codec's provide extensions? Should the output property be instead named mapred.text.output.compression.codec? In other words, will we use this same property to name the codecs for other output formats, or is this property for text files alone (in which case it should have 'text' in its name)?
          Hide
          Owen O'Malley added a comment -

          > Do we need io.compression.codecs if the codecs provide extensions? If so, then what's the
          > point of having the codec's provide extensions?

          Good point. I shouldn't encode the information twice. We do want the io.compression.codecs so that it is easy to extend the list of potential codecs. I could either make io.compression.codecs a straight list of codec classes or remove the getDefaultExtension method. Thoughts?

          > Should the output property be instead named mapred.text.output.compression.codec?
          > In other words, will we use this same property to name the codecs for other output formats, or is
          > this property for text files alone (in which case it should have 'text' in its name)?

          I would think that the SequenceFileOutputFormat (under HADOOP-441) should use the same property. At least that was my thought.

          Show
          Owen O'Malley added a comment - > Do we need io.compression.codecs if the codecs provide extensions? If so, then what's the > point of having the codec's provide extensions? Good point. I shouldn't encode the information twice. We do want the io.compression.codecs so that it is easy to extend the list of potential codecs. I could either make io.compression.codecs a straight list of codec classes or remove the getDefaultExtension method. Thoughts? > Should the output property be instead named mapred.text.output.compression.codec? > In other words, will we use this same property to name the codecs for other output formats, or is > this property for text files alone (in which case it should have 'text' in its name)? I would think that the SequenceFileOutputFormat (under HADOOP-441 ) should use the same property. At least that was my thought.
          Hide
          Owen O'Malley added a comment -

          the two issues share a common section of code dealing with the compression codecs.

          Show
          Owen O'Malley added a comment - the two issues share a common section of code dealing with the compression codecs.
          Owen O'Malley made changes -
          Field Original Value New Value
          Link This issue relates to HADOOP-441 [ HADOOP-441 ]
          Hide
          Doug Cutting added a comment -

          > We do want the io.compression.codecs so that it is easy to extend the list of potential codecs.

          You mean, so that its easy to extend the mapping from file extension to codec, right? Is there any other reason to enumerate codecs?

          > I would think that the SequenceFileOutputFormat should use the same property.

          But codec there will be noted inside the file, not in the extension, right? And compression there can either be value-only or of blocks of keys and values. So, even if you had a file extension, it wouldn't tell the whole story. So I don't think the property has anything to do with SequenceFile. So the property should proabably have 'text' in its name.

          Show
          Doug Cutting added a comment - > We do want the io.compression.codecs so that it is easy to extend the list of potential codecs. You mean, so that its easy to extend the mapping from file extension to codec, right? Is there any other reason to enumerate codecs? > I would think that the SequenceFileOutputFormat should use the same property. But codec there will be noted inside the file, not in the extension, right? And compression there can either be value-only or of blocks of keys and values. So, even if you had a file extension, it wouldn't tell the whole story. So I don't think the property has anything to do with SequenceFile. So the property should proabably have 'text' in its name.
          Owen O'Malley made changes -
          Link This issue is duplicated by HADOOP-374 [ HADOOP-374 ]
          Hide
          Owen O'Malley added a comment -

          This patch does:
          1. Fixes TextInputFormat to work with non-ascii UTF-8
          2. Adds a gzip codec for reading and writing .gz files
          3. Exposes Text.set(byte[], int, int) so that you can set the Text to a non-zero offset and length.
          4. Renames Text.validateUTF to validateUTF8
          5. Adds a CompressionCodecFactory that finds a codec based on a filename extension. The factory includes static methods to set/get the list of codecs in a Configuration.
          6. Adds test cases for reading gziped files for TextInputFormat
          7. Adds test cases for reading UTF8 via TextInputFormat
          8. Adds test cases for the CompressionCodecFactory
          9. InputFormatBase gets a new virtual to determine whether a file is splittable.
          10. TextInputFormat exposes a readLine method that reads bytes until a newline.
          11. TextOutputFormat will write compressed text files with a configurable compression codec.
          12. Removes a extra loop through the splits to count the number of bytes.

          Show
          Owen O'Malley added a comment - This patch does: 1. Fixes TextInputFormat to work with non-ascii UTF-8 2. Adds a gzip codec for reading and writing .gz files 3. Exposes Text.set(byte[], int, int) so that you can set the Text to a non-zero offset and length. 4. Renames Text.validateUTF to validateUTF8 5. Adds a CompressionCodecFactory that finds a codec based on a filename extension. The factory includes static methods to set/get the list of codecs in a Configuration. 6. Adds test cases for reading gziped files for TextInputFormat 7. Adds test cases for reading UTF8 via TextInputFormat 8. Adds test cases for the CompressionCodecFactory 9. InputFormatBase gets a new virtual to determine whether a file is splittable. 10. TextInputFormat exposes a readLine method that reads bytes until a newline. 11. TextOutputFormat will write compressed text files with a configurable compression codec. 12. Removes a extra loop through the splits to count the number of bytes.
          Owen O'Malley made changes -
          Attachment text-gz.patch [ 12340420 ]
          Owen O'Malley made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Doug Cutting added a comment -

          Should we specify defaults for mapred.output.compression.codec and io.compression.codecs in hadoop-default.xml?

          Is the TODO in SequenceFileOutputFormat.java now obsolete?

          Should we add compression-related configuration methods to JobConf, or instead as static methods to a (new) CompressConfig class. One would then do something like:

          CompressConfig.setOutputCodecClass(job, GzipCodec.class);

          Show
          Doug Cutting added a comment - Should we specify defaults for mapred.output.compression.codec and io.compression.codecs in hadoop-default.xml? Is the TODO in SequenceFileOutputFormat.java now obsolete? Should we add compression-related configuration methods to JobConf, or instead as static methods to a (new) CompressConfig class. One would then do something like: CompressConfig.setOutputCodecClass(job, GzipCodec.class);
          Hide
          Owen O'Malley added a comment -

          This patch extends the previous patch:
          1. the properties are added to the hadoop-default.xml
          2. the SequenceFileOutputFormat now uses the same JobConf methods for determining whether to compress the output and which codec to use
          3. cleaned up the out of date TODOs in the SequenceFile stuff
          4. created public static methods in CompressionCodecFactory to set/get the set of codecs into a Configuration
          5. created public static methods in SequenceFileOutputFormat to set/get the CompressionType. Arguably, these should be in SequenceFile itself

          I think having the [gs]etCompressOutput methods in JobConf is reasonable since they are intended to be used by all of the OutputFormats.

          Show
          Owen O'Malley added a comment - This patch extends the previous patch: 1. the properties are added to the hadoop-default.xml 2. the SequenceFileOutputFormat now uses the same JobConf methods for determining whether to compress the output and which codec to use 3. cleaned up the out of date TODOs in the SequenceFile stuff 4. created public static methods in CompressionCodecFactory to set/get the set of codecs into a Configuration 5. created public static methods in SequenceFileOutputFormat to set/get the CompressionType. Arguably, these should be in SequenceFile itself I think having the [gs] etCompressOutput methods in JobConf is reasonable since they are intended to be used by all of the OutputFormats.
          Owen O'Malley made changes -
          Attachment text-gz-2.patch [ 12340440 ]
          Hide
          Doug Cutting added a comment -

          > I think having the [gs]etCompressOutput methods in JobConf is reasonable since they are intended to be used by all of the OutputFormats.

          Yes, they're used by the output formats we supply, but probably not by all output formats that folks might define. To be consistent, I think we should adopt the following rule for JobConf: it should only set properties which are used by mapreduce kernel code, not by code in pluggable user classes. This has obviously not been the policy prior to this. There are exceptions in the current code. But I think this will lead to better code organization. Long-term, access to things like mapred.input.dir should move to InputFormatBase. Glancing through JobConf, there are only a few such things that would need to move: the vast majority of parameters set in JobConf are already kernel stuff. So we're close to implementing this rule and, if we elect to observe it, shouldn't stray further.

          Do you think this is a reasonable rule?

          Also, I think compressor should be spelled with an 'o'.

          Show
          Doug Cutting added a comment - > I think having the [gs] etCompressOutput methods in JobConf is reasonable since they are intended to be used by all of the OutputFormats. Yes, they're used by the output formats we supply, but probably not by all output formats that folks might define. To be consistent, I think we should adopt the following rule for JobConf: it should only set properties which are used by mapreduce kernel code, not by code in pluggable user classes. This has obviously not been the policy prior to this. There are exceptions in the current code. But I think this will lead to better code organization. Long-term, access to things like mapred.input.dir should move to InputFormatBase. Glancing through JobConf, there are only a few such things that would need to move: the vast majority of parameters set in JobConf are already kernel stuff. So we're close to implementing this rule and, if we elect to observe it, shouldn't stray further. Do you think this is a reasonable rule? Also, I think compressor should be spelled with an 'o'.
          Hide
          Owen O'Malley added a comment -

          This patch extends the previous one by:
          1. moving the set/get methods for output compression to OutputFormatBase
          2. moving the set/get methods for sequence file compression type to SequenceFile
          3. adding new methods to JobConf for set/get of the map output compression codec
          4. change the attribute names from mapred.seqfile.* to io.seqfile.*

          Show
          Owen O'Malley added a comment - This patch extends the previous one by: 1. moving the set/get methods for output compression to OutputFormatBase 2. moving the set/get methods for sequence file compression type to SequenceFile 3. adding new methods to JobConf for set/get of the map output compression codec 4. change the attribute names from mapred.seqfile.* to io.seqfile.*
          Owen O'Malley made changes -
          Attachment text-gz-3.patch [ 12340480 ]
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Owen!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Owen!
          Doug Cutting made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Doug Cutting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Owen O'Malley made changes -
          Component/s mapred [ 12310690 ]

            People

            • Assignee:
              Owen O'Malley
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development