Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-9996

Improve TFile format to support any compression codecs

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0-alpha1
    • Fix Version/s: None
    • Component/s: io
    • Labels:

      Description

      TFile is a container of key-value pairs. It supports block level compression by using compression codec. But one limitation of the current implementation is it supports only a few of fixed compression codecs. They are LZO, GZ or no compression. Some new compression codecs such as Snappy cannot be used because of this limitation.

      We propose to extend the existing TFile compression feature to support any compression codecs. As TFile already used the named compression codecs and stored the name in the file meta data (for example, “lzo” was stored when LZO compression is used), we cannot change this for backward compatibility. To make it support any compression codec, we add a special name “codec” after which follows the real codec class name. For example, “codec: org.apache.hadoop.io.compress.SnappyCodec” is used and stored in the meta when SnappyCodec is used as the compression codec. We can still use the existing fixed names such as “lzo”, “gz” or “none” for specifying the TFile compression codec.

        Attachments

        1. HADOOP-9996.patch
          15 kB
          Haifeng Chen

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jerrychenhf Haifeng Chen
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - 72h
                  72h
                  Remaining:
                  Remaining Estimate - 72h
                  72h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified