Hadoop Common
  1. Hadoop Common
  2. HADOOP-7909

Implement a generic splittable signature-based compression format

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: io
    • Labels:
      None

      Description

      I propose to take the suggestion of PIG-42 extend it to

      • add a more robust header such that false matches are vanishingly unlikely
      • repeat initial bytes of the header for very fast split searching
      • break down the stream into modest size chunks (~64k?) for rapid parallel encode and decode
      • provide length information on the blocks in advance to make block decode possible in hardware

      An optional extra header would be added to the gzip header, adding 36 bytes.

      <sh> := <version><signature><uncompressedDataLength><compressedRecordLength>
      <version> := 1 byte version field allowing us to later adjust the deader definition
      <signature> := 23 byte signature of the form aaaaaaabcdefghijklmnopr where each letter represents a randomly generated byte
      <uncompressedDataLength> := 32-bit length of the data compressed into this record
      <compressedRecordLength> := 32-bit length of this record as compressed, including all headers, trailers

      If multiple extra headers are present and the split header is not the first header, the initial implementation will not recognize the split.

      Input streams would be broken down into blocks which are appended, much as BlockCompressorStream does. Non-split-aware decoders will ignore this header and decode the appended blocks without ever noticing the difference.

      The signature has >= 132 bits of entropy which is sufficient for 80+ years of Moore's law before collisions become a significant concern.

      The first 7 bytes are repeated for speed. When splitting, the signature search will look for the 32-bit value aaaa every 4 bytes until a hit is found, then the next 4 bytes identify the alignment of the header mod 4 to identify a potential header match, then the whole header is validated at that offset. So, there is a load, compare, branch, and increment per 4 bytes searched.

      The existing gzip implementations do not provide access to the optional header fields (nor comment nor filename), so the entire gzip header will have to be reimplemented and compression will need to be done using the raw deflate options of the native library / built in deflater.

      There will be some degradation when using splittable gzip:

      • The gzip headers will each be 36 bytes larger. (4 byte extra header header, 32 byte extra header)
      • There will be one gzip header per block.
      • History will have to be reset with each block to allow starting from scratch at that offset resulting in some uncompressed bytes that would otherwise have been strings.

      Issues to consider:

      • Is the searching fast enough without the repeating 7 bytes in the signature?
      • Should this be a patch to the existing gzip classes to add a switch, or should this be a whole new class?
      • Which level does this belong at? CompressionStream? Compressor?
      • Is it more advantageous to encode the signature into the less dense comment field?
      • Optimum block size? Smaller splits faster and may conserve memory, larger provides slightly better compression ratio.

        Issue Links

          Activity

          Hide
          Tim Broberg added a comment -

          Renaming to reflect abandonment of gzip compatibility as a goal.

          Show
          Tim Broberg added a comment - Renaming to reflect abandonment of gzip compatibility as a goal.
          Hide
          Tim Broberg added a comment -

          Well, as long as we're calling it a different format, no reason remains to stick with the gzip header wrapped around this.

          We can more or less pull the extension header out and use this directly.

          Show
          Tim Broberg added a comment - Well, as long as we're calling it a different format, no reason remains to stick with the gzip header wrapped around this. We can more or less pull the extension header out and use this directly.
          Hide
          Tim Broberg added a comment -

          Still doing mental experiments here, I'm not advocating a particular approach yet...

          1 - Is there a fallback if we "fail" to split? What happens if a split "fails?" Consider the second and subsequent splits. They searches from the start offset to then end and find no signatures, so they output 0 bytes. The first split starts output from offset 0 and extends his end until he finds a signature or hits EOF, so he outputs the whole thing. In this case, by using the splittable decode for a non-splittable stream, we waste time searching for split headers that don't exist, but the output is correct. Not ideal, but not fatal either. If it were a rare phenomenon to mix splittable and non-splittable, this might make sense, but probably not.

          2 - Can we failover to HADOOP-7076? What if the first split "succeeds" and the seonds "fails." For HADOOP-7076, I imagine handling this by attempting to decode each split with the signature approach, and searching up to some reasonable max block size (1MB? 2MB?) for a max search time of ~1us. If we are unable to find a splittable gzip header at this point, or at any subsequent point when we look at the next gzip header, we fail over from the block-mode regime to the stream-mode regime of the HADOOP-7076 decoder for the remainder of that split.

          I guess my current problem with approach #2 here is a mismatch in buffering / caching needs between the two approaches. The block approach wants nice big buffers, and the signature search will walk over a significant amount of data before failing. We don't want to read this data twice. For each fail, it's wasteful to load up O(1MB) of data just to feed it into a stream decompressor that is perfectly happy to process the data in 4k hunks. Multiply that 1MB by the number of open compression streams, and this could add up to some pretty significant overhead for no real gain.

          I'm trying to avoid adding to the complexity of the interface with more codecs and formats, but the costs aren't seeming like they are worth the benefits to me at this point.

          I'm inclined to call it a different format as Niels originally suggested, despite the fact that it is read-compatible with .gz's.

          Show
          Tim Broberg added a comment - Still doing mental experiments here, I'm not advocating a particular approach yet... 1 - Is there a fallback if we "fail" to split? What happens if a split "fails?" Consider the second and subsequent splits. They searches from the start offset to then end and find no signatures, so they output 0 bytes. The first split starts output from offset 0 and extends his end until he finds a signature or hits EOF, so he outputs the whole thing. In this case, by using the splittable decode for a non-splittable stream, we waste time searching for split headers that don't exist, but the output is correct. Not ideal, but not fatal either. If it were a rare phenomenon to mix splittable and non-splittable, this might make sense, but probably not. 2 - Can we failover to HADOOP-7076 ? What if the first split "succeeds" and the seonds "fails." For HADOOP-7076 , I imagine handling this by attempting to decode each split with the signature approach, and searching up to some reasonable max block size (1MB? 2MB?) for a max search time of ~1us. If we are unable to find a splittable gzip header at this point, or at any subsequent point when we look at the next gzip header, we fail over from the block-mode regime to the stream-mode regime of the HADOOP-7076 decoder for the remainder of that split. I guess my current problem with approach #2 here is a mismatch in buffering / caching needs between the two approaches. The block approach wants nice big buffers, and the signature search will walk over a significant amount of data before failing. We don't want to read this data twice. For each fail, it's wasteful to load up O(1MB) of data just to feed it into a stream decompressor that is perfectly happy to process the data in 4k hunks. Multiply that 1MB by the number of open compression streams, and this could add up to some pretty significant overhead for no real gain. I'm trying to avoid adding to the complexity of the interface with more codecs and formats, but the costs aren't seeming like they are worth the benefits to me at this point. I'm inclined to call it a different format as Niels originally suggested, despite the fact that it is read-compatible with .gz's.
          Hide
          Niels Basjes added a comment -

          Tim,

          I think that both of those questions are hard because of the fact that Hadoop MapReduce does roughly these steps in this order:
          1. Look at the input filename and filesize
          2. Determine codec and if it is splittable
          3. Create the split definitions
          4. Distribute these split definitions over many tasks / nodes on your cluster
          5. Each task opens a file and tries to process the indicated split.

          If you find around 5. that you cannot split correctly, then you cannot go back to 3. to redefine the splits.
          Also I expect that trying the HADOOP-7076 way afterwards is hard: What if the first split succeeds and the second one doesn't (which run on completely separate nodes)?

          Niels

          Show
          Niels Basjes added a comment - Tim, I think that both of those questions are hard because of the fact that Hadoop MapReduce does roughly these steps in this order: 1. Look at the input filename and filesize 2. Determine codec and if it is splittable 3. Create the split definitions 4. Distribute these split definitions over many tasks / nodes on your cluster 5. Each task opens a file and tries to process the indicated split. If you find around 5. that you cannot split correctly, then you cannot go back to 3. to redefine the splits. Also I expect that trying the HADOOP-7076 way afterwards is hard: What if the first split succeeds and the second one doesn't (which run on completely separate nodes)? Niels
          Hide
          Tim Broberg added a comment -

          Good points, Niels.

          Given that, your approach of a separate codec / file type makes a lot of sense, and I'm inclined to go that way.

          Still, a few questions to make sure this is the right approach:

          • Is it easy to switch over to the deflate and discard method of HADOOP-7076 if we find the split
            headers aren't supported?
          • If we advertise a codec as splittable and then we fail to split it, is there any way to fallback and just return one great big split that gives the same performance as if we just didn't say we do splitting?
          Show
          Tim Broberg added a comment - Good points, Niels. Given that, your approach of a separate codec / file type makes a lot of sense, and I'm inclined to go that way. Still, a few questions to make sure this is the right approach: Is it easy to switch over to the deflate and discard method of HADOOP-7076 if we find the split headers aren't supported? If we advertise a codec as splittable and then we fail to split it, is there any way to fallback and just return one great big split that gives the same performance as if we just didn't say we do splitting?
          Hide
          Niels Basjes added a comment -

          Let me elaborate my thoughts to aid you in constructing a sound design.

          Assume you call it .gz --> then your setup must be capable of handling the non-splittable gzip files, splittable gzip files and files where the chunck size is way off (say 100MB instead of 64K).

          A big part of the problem lies in the way the FileInputSplits are created: Without reading the actual input file.
          Have a look at TextInputFormat and it's super class FileInputFormat (both in hadoop-mapreduce-client-core). The only check that is done is if the codec (selected by filename extension) implements SplittableCompressionCodec. The splits are then created by looking at some config settings ... not the inputfile.

          Assume you call it something custom (like the .sgz I mentioned) then you still have to be able to handle the "huge chunks" situation. So if I create a 1000MB file with chunks of 100MB and instruct the config to create 10MB splits ... there will be too many map tasks (running independently on different nodes).

          Or ... you actually define both the extension and the chunksize as requirements.
          To me that would that be a new file format based on gzip.

          Just my 2 cents ...

          Show
          Niels Basjes added a comment - Let me elaborate my thoughts to aid you in constructing a sound design. Assume you call it .gz --> then your setup must be capable of handling the non-splittable gzip files, splittable gzip files and files where the chunck size is way off (say 100MB instead of 64K). A big part of the problem lies in the way the FileInputSplits are created: Without reading the actual input file. Have a look at TextInputFormat and it's super class FileInputFormat (both in hadoop-mapreduce-client-core). The only check that is done is if the codec (selected by filename extension) implements SplittableCompressionCodec. The splits are then created by looking at some config settings ... not the inputfile. Assume you call it something custom (like the .sgz I mentioned) then you still have to be able to handle the "huge chunks" situation. So if I create a 1000MB file with chunks of 100MB and instruct the config to create 10MB splits ... there will be too many map tasks (running independently on different nodes). Or ... you actually define both the extension and the chunksize as requirements. To me that would that be a new file format based on gzip. Just my 2 cents ...
          Hide
          Tim Broberg added a comment -

          > A "normal" gzipped file won't become splittable with this codec. The way I interpret what I've read so far is that this comes very close to defining a new fileformat: a block compressed deflate file that due to the clever way it is stored can also be read by any existing gzip decompression tool.

          I'd rather say a new option than a new format, but I would agree with everything else here. Splittable streams / files decode normally, normal streams / files would not be rendered splittable under this scheme.

          > 1.how does the Hadoop job system know that it should try to create splits for the input file?

          This is one area I would very much like feedback. I would propose either A - by selecting a dedicated to splittable-gzip codec, or B - by selecting a splittable option in a modified version of the existing codec much like the compression level option is selected today.

          > 2.should the files use the .gz file extension? Or perhaps something like .sgz (splittable gz) instead?

          I'm torn here as well. Given that we are conforming to the existing gzip spec, and given that a aplitting codec can decode non-splittable streams and the non-splitting codec can decode splitting streams, is there ant harm in encoding splittable gzip into .gz's?

          Certainly if we take option B above such that the same codec encodes both formats we would make the same codec decode both, and I can see no reason at all to have a separate fiel extension.

          > 3.what will be the advantages over the existing splittable compression options we have now (LZO/Snappy/Bzip2/...)? Why would I as a Hadoop developer/administrator want to choose this codec?

          Snappy

          • is faster to decode and much faster to encode
          • does not compress as thoroughly
          • is not splittable in and of itself (but can be used in Sequence Files and Avro which are splittable)
            LZO
          • inferior to snappy in speed
          • comparable to snappy in compression ratio
          • splittable by means of an indexing scheme which creates a separate file
          • encumbered with some licensing issues
            Bzip2
          • inferior to gzip/deflate in speed
          • supperior to gzip/deflate in compression
          • splittable by a bit by bit signature search
          • the signature is only 48 bits in length
          • the value for the signature is not random and may appear in natural data

          So, you would choose this codec when

          • you value compression efficiency over speed (compared to LZO and Snappy)
          • you want splittability, but don't want to deal with a Sequence File / Avro
          • you want more speed and/or splitting robustness than Bzip2 can provide

          Note also that this scheme closely follows the Bzip2 splittability model such that we don't have to implement yet another set of basic splitability classes. The LZO scheme is a radically different beast.

          Having said all this, gzip explicitly does not require the compression algorithm to be Deflate. This scheme is adaptable to allow splitting of any compression format by using a different CM value in the gzip header, although zlib libraries would not decode this format. Supporting other formats would require the gzip decoder to evaluate the CM byte and pass control to the appropriate decompressor rather than just running the whole record, header and all into zlib as we do now.

          Thanks so much for your review, thoughts, and questions!

          • Tim.
          Show
          Tim Broberg added a comment - > A "normal" gzipped file won't become splittable with this codec. The way I interpret what I've read so far is that this comes very close to defining a new fileformat: a block compressed deflate file that due to the clever way it is stored can also be read by any existing gzip decompression tool. I'd rather say a new option than a new format, but I would agree with everything else here. Splittable streams / files decode normally, normal streams / files would not be rendered splittable under this scheme. > 1.how does the Hadoop job system know that it should try to create splits for the input file? This is one area I would very much like feedback. I would propose either A - by selecting a dedicated to splittable-gzip codec, or B - by selecting a splittable option in a modified version of the existing codec much like the compression level option is selected today. > 2.should the files use the .gz file extension? Or perhaps something like .sgz (splittable gz) instead? I'm torn here as well. Given that we are conforming to the existing gzip spec, and given that a aplitting codec can decode non-splittable streams and the non-splitting codec can decode splitting streams, is there ant harm in encoding splittable gzip into .gz's? Certainly if we take option B above such that the same codec encodes both formats we would make the same codec decode both, and I can see no reason at all to have a separate fiel extension. > 3.what will be the advantages over the existing splittable compression options we have now (LZO/Snappy/Bzip2/...)? Why would I as a Hadoop developer/administrator want to choose this codec? Snappy is faster to decode and much faster to encode does not compress as thoroughly is not splittable in and of itself (but can be used in Sequence Files and Avro which are splittable) LZO inferior to snappy in speed comparable to snappy in compression ratio splittable by means of an indexing scheme which creates a separate file encumbered with some licensing issues Bzip2 inferior to gzip/deflate in speed supperior to gzip/deflate in compression splittable by a bit by bit signature search the signature is only 48 bits in length the value for the signature is not random and may appear in natural data So, you would choose this codec when you value compression efficiency over speed (compared to LZO and Snappy) you want splittability, but don't want to deal with a Sequence File / Avro you want more speed and/or splitting robustness than Bzip2 can provide Note also that this scheme closely follows the Bzip2 splittability model such that we don't have to implement yet another set of basic splitability classes. The LZO scheme is a radically different beast. Having said all this, gzip explicitly does not require the compression algorithm to be Deflate. This scheme is adaptable to allow splitting of any compression format by using a different CM value in the gzip header, although zlib libraries would not decode this format. Supporting other formats would require the gzip decoder to evaluate the CM byte and pass control to the appropriate decompressor rather than just running the whole record, header and all into zlib as we do now. Thanks so much for your review, thoughts, and questions! Tim.
          Hide
          Niels Basjes added a comment -

          If I understand PIG-42 and your description correctly this type of making gzip splittable is based upon working with an input file that has been specially crafted. A "normal" gzipped file won't become splittable with this codec.
          The way I interpret what I've read so far is that this comes very close to defining a new fileformat: a block compressed deflate file that due to the clever way it is stored can also be read by any existing gzip decompression tool.

          So I dare to ask:

          1. how does the Hadoop job system know that it should try to create splits for the input file?
          2. should the files use the .gz file extension? Or perhaps something like .sgz (splittable gz) instead?
          3. what will be the advantages over the existing splittable compression options we have now (LZO/Snappy/Bzip2/...)? Why would I as a Hadoop developer/administrator want to choose this codec?
          Show
          Niels Basjes added a comment - If I understand PIG-42 and your description correctly this type of making gzip splittable is based upon working with an input file that has been specially crafted. A "normal" gzipped file won't become splittable with this codec. The way I interpret what I've read so far is that this comes very close to defining a new fileformat: a block compressed deflate file that due to the clever way it is stored can also be read by any existing gzip decompression tool. So I dare to ask: how does the Hadoop job system know that it should try to create splits for the input file? should the files use the .gz file extension? Or perhaps something like .sgz (splittable gz) instead? what will be the advantages over the existing splittable compression options we have now (LZO/Snappy/Bzip2/...)? Why would I as a Hadoop developer/administrator want to choose this codec?

            People

            • Assignee:
              Unassigned
              Reporter:
              Tim Broberg
            • Votes:
              2 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 672h
                672h
                Remaining:
                Remaining Estimate - 672h
                672h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development