[HADOOP-11644] Contribute CMX compression - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: io
Labels:
None

Target Version/s:

2.8.3

Description

Hadoop natively supports four main compression algorithms: BZIP2, LZ4, Snappy and ZLIB.

Each one of these algorithms fills a gap:

bzip2 : Very high compression ratio, splittable
LZ4 : Very fast, non splittable
Snappy : Very fast, non splittable
zLib : good balance of compression and speed.

We think there is a gap for a compression algorithm that can perform fast compress and decompress, while also being splittable. This can help significantly on jobs where the input file sizes are >= 1GB.
For this, IBM has developed CMX. CMX is a dictionary-based, block-oriented, splittable, concatenable compression algorithm developed specifically for Hadoop workloads. Many of our customers use CMX, and we would love to be able to contribute it to hadoop-common.

CMX is block oriented : We typically use 64k blocks. Blocks are independently decompressable.

CMX is splittable : We implement the SplittableCompressionCodec interface. All CMX files are a multiple of 64k, so the splittability is achieved in a simple way with no need for external indexes.

CMX is concatenable : Two independent CMX files can be concatenated together. We have seen that some projects like Apache Flume require this feature.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-11644.001.patch
25/Jun/15 01:41
154 kB
Xabriel J Collazo Mojica
HADOOP-11644.002.patch
02/Jul/15 00:26
152 kB
Xabriel J Collazo Mojica

Issue Links

is related to

PIG-4341 Add CMX support to pig.tmpfilecompression.codec

Open

Activity

People

Assignee:: Xabriel J Collazo Mojica

Reporter:: Xabriel J Collazo Mojica

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 28/Feb/15 00:11

Updated:: 11/Sep/17 05:21

Time Tracking

Estimated:

336h

Remaining:

336h

Logged:

Not Specified