Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: io
    • Labels:
      None

      Description

      (note: related to HADOOP-4874)

      As per Doug's earlier comments, LZF does indeed look like a good compressor candidate for fast compression/decompression, good enough compression rate.
      From my testing it seems at least twice as fast at compression, and somewhat faster for decompressing than gzip.
      Code from http://h2database.googlecode.com/svn/trunk/h2/src/main/org/h2/compress/ is applicable, and I have tested it with json data.

      I hope to have more to spend on this in near future, but if someone else gets to this first that'd be good too.

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Also similar to HADOOP-6349 for FastLZ library

          Show
          Todd Lipcon added a comment - Also similar to HADOOP-6349 for FastLZ library
          Hide
          Tatu Saloranta added a comment -

          Ok: I am now working with Voldemort team to get goot LZF codec adaptation (need byte[]->byte[], no need for streams in this case; also prefer using lzf standard framing so that c version is compatible), and code is available at http://github.com/ijuma/h2-lzf.

          I can now have a look at what interface Hadoop uses for codecs, to see what would be the best way to get same or modified code hooked up.

          Also: one interesting thing about LZF is that its framing is not only very simple, but probably nice for splitting/merging larger files. There is no separate per-file header; instead, it is just a sequence of chunks with minimalistic headers. This means that you can just append chunks by concatenation; or split them in reverse direction, even shuffle if need be. And skipping through chunks can be done using headers without decompressing actual contents. Sounds quite nice for hadoop's use case in general... but I don't know how much support is needed from codec to let framework make good use of this.

          Show
          Tatu Saloranta added a comment - Ok: I am now working with Voldemort team to get goot LZF codec adaptation (need byte[]->byte[], no need for streams in this case; also prefer using lzf standard framing so that c version is compatible), and code is available at http://github.com/ijuma/h2-lzf . I can now have a look at what interface Hadoop uses for codecs, to see what would be the best way to get same or modified code hooked up. Also: one interesting thing about LZF is that its framing is not only very simple, but probably nice for splitting/merging larger files. There is no separate per-file header; instead, it is just a sequence of chunks with minimalistic headers. This means that you can just append chunks by concatenation; or split them in reverse direction, even shuffle if need be. And skipping through chunks can be done using headers without decompressing actual contents. Sounds quite nice for hadoop's use case in general... but I don't know how much support is needed from codec to let framework make good use of this.
          Hide
          Tatu Saloranta added a comment -

          Hmmh. Looking at hadoop-commons compress package, I realize that hadoop compressors are rather complicated beasts... it's bit like reading blueprint of a lunar module or something.
          At least compared to relative simplicity of lzf codec to wrap within framework.
          So I could use some help in figuring out best way to properly embed lzf in there, including ability to support splitting.

          Show
          Tatu Saloranta added a comment - Hmmh. Looking at hadoop-commons compress package, I realize that hadoop compressors are rather complicated beasts... it's bit like reading blueprint of a lunar module or something. At least compared to relative simplicity of lzf codec to wrap within framework. So I could use some help in figuring out best way to properly embed lzf in there, including ability to support splitting.
          Hide
          Tatu Saloranta added a comment -

          Although I have not worked on integration, I have been able to get a simple reusable LZF block codec published, available from github (http://github.com/ning/compress), and main Maven repo (group com.ning, artifact compress-lzf). So at least simple part (codec itself) is ready for anyone with enough familiarity to handle full integration, ideally supporting access at least at block level (can read starting from block boundaries; blocks are byte-aligned, contain compress and uncompressed block lengths to support somewhat efficient skipping of blocks).

          Show
          Tatu Saloranta added a comment - Although I have not worked on integration, I have been able to get a simple reusable LZF block codec published, available from github ( http://github.com/ning/compress ), and main Maven repo (group com.ning, artifact compress-lzf). So at least simple part (codec itself) is ready for anyone with enough familiarity to handle full integration, ideally supporting access at least at block level (can read starting from block boundaries; blocks are byte-aligned, contain compress and uncompressed block lengths to support somewhat efficient skipping of blocks).
          Hide
          Tatu Saloranta added a comment -

          Lzf4hadoop project at github – https://github.com/ning/lzf4hadoop – now provides necessary wrappers.
          I hope to get more testing done to ensure interaction with hadoop abstractions work as intended; assuming things go well, this could serve as the implementation to use. Or, if separate project & maven-accessible artifacts are enough, maybe just add a link from documentation.

          As to performance, see https://github.com/ning/jvm-compressor-benchmark .
          LZF is the fastest pure java compressor tested; of all included codecs Snappy (which uses JNI to use C impl of snappy codec) is faster for decompression, and about as fast for compression.

          Compression rates between basic lempel-ziv implementations (quiclz, lzo, snappy, lzf) are comparable; and all are significantly faster than basic deflate (but with lower compression rates).

          Show
          Tatu Saloranta added a comment - Lzf4hadoop project at github – https://github.com/ning/lzf4hadoop – now provides necessary wrappers. I hope to get more testing done to ensure interaction with hadoop abstractions work as intended; assuming things go well, this could serve as the implementation to use. Or, if separate project & maven-accessible artifacts are enough, maybe just add a link from documentation. As to performance, see https://github.com/ning/jvm-compressor-benchmark . LZF is the fastest pure java compressor tested; of all included codecs Snappy (which uses JNI to use C impl of snappy codec) is faster for decompression, and about as fast for compression. Compression rates between basic lempel-ziv implementations (quiclz, lzo, snappy, lzf) are comparable; and all are significantly faster than basic deflate (but with lower compression rates).

            People

            • Assignee:
              Unassigned
              Reporter:
              Tatu Saloranta
            • Votes:
              8 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

              • Created:
                Updated:

                Development