Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11644

Contribute CMX compression

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • io
    • None

    Description

      Hadoop natively supports four main compression algorithms: BZIP2, LZ4, Snappy and ZLIB.

      Each one of these algorithms fills a gap:

      bzip2 : Very high compression ratio, splittable
      LZ4 : Very fast, non splittable
      Snappy : Very fast, non splittable
      zLib : good balance of compression and speed.

      We think there is a gap for a compression algorithm that can perform fast compress and decompress, while also being splittable. This can help significantly on jobs where the input file sizes are >= 1GB.
      For this, IBM has developed CMX. CMX is a dictionary-based, block-oriented, splittable, concatenable compression algorithm developed specifically for Hadoop workloads. Many of our customers use CMX, and we would love to be able to contribute it to hadoop-common.

      CMX is block oriented : We typically use 64k blocks. Blocks are independently decompressable.

      CMX is splittable : We implement the SplittableCompressionCodec interface. All CMX files are a multiple of 64k, so the splittability is achieved in a simple way with no need for external indexes.

      CMX is concatenable : Two independent CMX files can be concatenated together. We have seen that some projects like Apache Flume require this feature.

      Attachments

        1. HADOOP-11644.001.patch
          154 kB
          Xabriel J Collazo Mojica
        2. HADOOP-11644.002.patch
          152 kB
          Xabriel J Collazo Mojica

        Issue Links

          Activity

            People

              xabriel Xabriel J Collazo Mojica
              xabriel Xabriel J Collazo Mojica
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 336h
                  336h
                  Remaining:
                  Remaining Estimate - 336h
                  336h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified