Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11644

Contribute CMX compression

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: io
    • Labels:
      None
    • Target Version/s:

      Description

      Hadoop natively supports four main compression algorithms: BZIP2, LZ4, Snappy and ZLIB.

      Each one of these algorithms fills a gap:

      bzip2 : Very high compression ratio, splittable
      LZ4 : Very fast, non splittable
      Snappy : Very fast, non splittable
      zLib : good balance of compression and speed.

      We think there is a gap for a compression algorithm that can perform fast compress and decompress, while also being splittable. This can help significantly on jobs where the input file sizes are >= 1GB.
      For this, IBM has developed CMX. CMX is a dictionary-based, block-oriented, splittable, concatenable compression algorithm developed specifically for Hadoop workloads. Many of our customers use CMX, and we would love to be able to contribute it to hadoop-common.

      CMX is block oriented : We typically use 64k blocks. Blocks are independently decompressable.

      CMX is splittable : We implement the SplittableCompressionCodec interface. All CMX files are a multiple of 64k, so the splittability is achieved in a simple way with no need for external indexes.

      CMX is concatenable : Two independent CMX files can be concatenated together. We have seen that some projects like Apache Flume require this feature.

        Attachments

        1. HADOOP-11644.001.patch
          154 kB
          Xabriel J Collazo Mojica
        2. HADOOP-11644.002.patch
          152 kB
          Xabriel J Collazo Mojica

          Issue Links

            Activity

              People

              • Assignee:
                xabriel Xabriel J Collazo Mojica
                Reporter:
                xabriel Xabriel J Collazo Mojica
              • Votes:
                0 Vote for this issue
                Watchers:
                16 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - 336h
                  336h
                  Remaining:
                  Remaining Estimate - 336h
                  336h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified