[HADOOP-8800] Dynamic Compress Stream - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.1-alpha
Fix Version/s: None
Component/s: io
Labels:
- patch

Target Version/s:

2.0.1-alpha
Tags:
compress,dynamic

Description

Introduce

This is a patch to provide a feature that a big file can be transferred in different compress algorithm and level dynamically. So the performance would be better.

Why

Compression is important in transfer big file. I found that we should need different compress algorithm in different cases. I have tested some cases.

	compression ratio	compression throughput(MB/s)	Throughput in 100MB/s network	Throughput in 10MB/s network
ZLIB	35.80%	9.6	9.6	9.6
LZO	54.40%	101.7	101.7	18.38235294
LIBLZF	54.60%	134.3	134.3	18.3150183
QUICKLZ	54.90%	183.4	182.1493625	18.21493625
FASTLZ	56.20%	134.4	134.4	17.79359431
SNAPPY	59.80%	189	167.2240803	16.72240803
NONE	100%	300	100	10

So there are no “perfect compress algorithm” can suit all cases. I want to build a dynamic “CompressOutputStream” to change “compress algorithm” in runtime.

In a busy cluster, The CPU and the network is changing all the time. There are some cases:

The cluster is very huge; the network between each node is not the same. Some are 10MB/s, some are 100MB/s. We cannot choose a perfect compress algorithm in a MapReduce Job.
The CPU can be use up when there is a “big job”, but the network is free. The job should not compress any more.
Some node in a cluster have high load, but the others not. We should use different compress algorithm for them.

What

In my idea, I can transfer a file with blocks. First block may use a compress algorithm such as LZO. After transfer it, we can get some information, and decide what the compress algorithm in the next block. Like TCP, it starts up slowly, and try run faster and faster. It can make the io faster by choose a more suitable compress algorithm.

Design

In a big file transfer, compress and network all takes time. Consider to transfer a fixed size file:

 
T = c/p + r/s

Define:

T: the total time used
C:the CPU cycle to compress the file
P:the available CPU GHz for compressing
R:compress radio
S:the available speed(throughput) of network (including decompress)

The variable are not fixed, the all can be change mainly by some reason.

C:decide by different content , algorithm and algorithm level
P: decide by CPU , processes at the machine
R: decide by different content , algorithm and algorithm level
S: decide by network, the pairs’ network and processes at the machine.

The file is transferred by block. After a block transferred, there is some information we can get:

C/P : the time takes for compress
R : the compress radio for the block
R/S : the time takes for network

With the information and some reasonable assumptions, we can forecast each compress algorithm’s performance. The reasonable assumptions are:

In one transfer, the content is similar
P, S is continuous, we can assume that the next P, S is the same as the current one.
C, R is proportional totally. For example, LZO is always faster than ZIP.

With the information and reasonable assumptions, we can forecast the next one by:

C2/P2 = (last C1/P1 ) * (avg C2/C1)
R2/S2 = F(R1) / S1 (S1 is known)
F(R1)=R1*(avg R2/R1)

Then we can know the time each compress algorithm needs. And choose the best one to compress next block.
To optimize the avg values, we must log some statistics information. To statistic the avg, we can set an N = 3, 5…, and change the avg value after each information comes.

Avg V = (n-1) / n * Avg V+ V / n

Next Work

I would try to submit a patch later. And is there anyone interested about that?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Kay Yan

Votes:: 2 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 13/Sep/12 13:22

Updated:: 14/Sep/12 09:55

Time Tracking

Estimated:

168h

Remaining:

168h

Logged:

Not Specified