Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: datanode, hdfs-client
    • Labels:
      None

      Description

      In practice, we find that a lot of users store text data in HDFS without using any compression codec. Improving usability of compressible formats like Avro/RCFile helps with this, but we could also help many users by providing an option to transparently compress data as it is stored.

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          I'm thinking something like the following:

          • DFSClient can optionally specify a compression codec when writing a file. If specified, each "packet" in the write pipeline will be compressed with that codec.
          • DataNode uses a special header in the block meta file to indicate that the block is compressed with the given codec.
          • To facilitate random access, an index file is kept (either separately or part of the block meta file) which contains pairs of (uncompressed offset, compressed offset). This allows binary search to each compression block.
          • DFSClient reader is modified to support decompression on the client side.
          • Some handshaking will be necessary in case the set of codecs available on the client and server differ.

          Any thoughts on this? Not sure when I'd have time to work on it, but worth starting some brainstorming.

          Show
          Todd Lipcon added a comment - I'm thinking something like the following: DFSClient can optionally specify a compression codec when writing a file. If specified, each "packet" in the write pipeline will be compressed with that codec. DataNode uses a special header in the block meta file to indicate that the block is compressed with the given codec. To facilitate random access, an index file is kept (either separately or part of the block meta file) which contains pairs of (uncompressed offset, compressed offset). This allows binary search to each compression block. DFSClient reader is modified to support decompression on the client side. Some handshaking will be necessary in case the set of codecs available on the client and server differ. Any thoughts on this? Not sure when I'd have time to work on it, but worth starting some brainstorming.
          Hide
          Uma Maheswara Rao G added a comment -

          Hi Todd,

          In our cluster we had implemented the compression support for HDFS (HDFS-1640).
          But here we were not storing the compressed data in DFS. We will decompress and store the data. Main goal of our compression is to save the network bandwidth. We could achieve ~50-70% improvements in read and write operations.

          Not sure when I'd have time to work on it

          We will be happy to coordinate our efforts in implemening this feature.

          Show
          Uma Maheswara Rao G added a comment - Hi Todd, In our cluster we had implemented the compression support for HDFS ( HDFS-1640 ). But here we were not storing the compressed data in DFS. We will decompress and store the data. Main goal of our compression is to save the network bandwidth. We could achieve ~50-70% improvements in read and write operations. Not sure when I'd have time to work on it We will be happy to coordinate our efforts in implemening this feature.
          Hide
          Michael Schmitz added a comment -

          An easier feature might be to automatically set up the proper codec when reading the file extension as input to a job. Also, when using streaming with compression you get the offset as the key, but not when you use an uncompressed TSV. It would be nice if this behavior were uniform.

          Show
          Michael Schmitz added a comment - An easier feature might be to automatically set up the proper codec when reading the file extension as input to a job. Also, when using streaming with compression you get the offset as the key, but not when you use an uncompressed TSV. It would be nice if this behavior were uniform.
          Hide
          Suresh Srinivas added a comment -

          Todd, given how this functionality shapes up, it could make lot of changes to HDFS. Please post a design document, when the mechanism is in reasonable shape.

          Show
          Suresh Srinivas added a comment - Todd, given how this functionality shapes up, it could make lot of changes to HDFS. Please post a design document, when the mechanism is in reasonable shape.
          Hide
          Roy Roye added a comment -

          This http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-36.pdf says:

          We analyzed how compression can improve performance and energy efficiency for MapReduce workloads. Our results show that compression provides 35-60% energy savings for read heavy jobs as well as jobs with highly compressible data. Based on our measurements, we construct an algorithm that examines per-job data characteristics and IO patterns, and decides when and where to use compression.

          Show
          Roy Roye added a comment - This http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-36.pdf says: We analyzed how compression can improve performance and energy efficiency for MapReduce workloads. Our results show that compression provides 35-60% energy savings for read heavy jobs as well as jobs with highly compressible data. Based on our measurements, we construct an algorithm that examines per-job data characteristics and IO patterns, and decides when and where to use compression.

            People

            • Assignee:
              Unassigned
              Reporter:
              Todd Lipcon
            • Votes:
              4 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

              • Created:
                Updated:

                Development