Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-8800

Dynamic Compress Stream

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.1-alpha
    • None
    • io

    Description

      Introduce

      This is a patch to provide a feature that a big file can be transferred in different compress algorithm and level dynamically. So the performance would be better.

      Why

      Compression is important in transfer big file. I found that we should need different compress algorithm in different cases. I have tested some cases.

        compression ratio compression throughput(MB/s) Throughput in 100MB/s network Throughput in 10MB/s network
      ZLIB 35.80% 9.6 9.6 9.6
      LZO 54.40% 101.7 101.7 18.38235294
      LIBLZF 54.60% 134.3 134.3 18.3150183
      QUICKLZ 54.90% 183.4 182.1493625 18.21493625
      FASTLZ 56.20% 134.4 134.4 17.79359431
      SNAPPY 59.80% 189 167.2240803 16.72240803
      NONE 100% 300 100 10

      So there are no “perfect compress algorithm” can suit all cases. I want to build a dynamic “CompressOutputStream” to change “compress algorithm” in runtime.

      In a busy cluster, The CPU and the network is changing all the time. There are some cases:

      • The cluster is very huge; the network between each node is not the same. Some are 10MB/s, some are 100MB/s. We cannot choose a perfect compress algorithm in a MapReduce Job.
      • The CPU can be use up when there is a “big job”, but the network is free. The job should not compress any more.
      • Some node in a cluster have high load, but the others not. We should use different compress algorithm for them.

      What

      In my idea, I can transfer a file with blocks. First block may use a compress algorithm such as LZO. After transfer it, we can get some information, and decide what the compress algorithm in the next block. Like TCP, it starts up slowly, and try run faster and faster. It can make the io faster by choose a more suitable compress algorithm.

      Design

      In a big file transfer, compress and network all takes time. Consider to transfer a fixed size file:

       
      T = c/p + r/s
      

      Define:

      • T: the total time used
      • C:the CPU cycle to compress the file
      • P:the available CPU GHz for compressing
      • R:compress radio
      • S:the available speed(throughput) of network (including decompress)

      The variable are not fixed, the all can be change mainly by some reason.

      • C:decide by different content , algorithm and algorithm level
      • P: decide by CPU , processes at the machine
      • R: decide by different content , algorithm and algorithm level
      • S: decide by network, the pairs’ network and processes at the machine.

      The file is transferred by block. After a block transferred, there is some information we can get:

      • C/P : the time takes for compress
      • R : the compress radio for the block
      • R/S : the time takes for network

      With the information and some reasonable assumptions, we can forecast each compress algorithm’s performance. The reasonable assumptions are:

      • In one transfer, the content is similar
      • P, S is continuous, we can assume that the next P, S is the same as the current one.
      • C, R is proportional totally. For example, LZO is always faster than ZIP.

      With the information and reasonable assumptions, we can forecast the next one by:

      • C2/P2 = (last C1/P1 ) * (avg C2/C1)
      • R2/S2 = F(R1) / S1 (S1 is known)
      • F(R1)=R1*(avg R2/R1)

      Then we can know the time each compress algorithm needs. And choose the best one to compress next block.
      To optimize the avg values, we must log some statistics information. To statistic the avg, we can set an N = 3, 5…, and change the avg value after each information comes.

      Avg V = (n-1) / n * Avg V+ V / n
      

      Next Work

      I would try to submit a patch later. And is there anyone interested about that?

      Attachments

        Activity

          People

            Unassigned Unassigned
            yankay Kay Yan
            Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified