Pig
  1. Pig
  2. PIG-42

Pig should be able to split Gzip files like it can split Bzip files

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: impl
    • Labels:
      None

      Description

      It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.

      Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

      1. gzip.patch
        15 kB
        Benjamin Reed

        Issue Links

          Activity

          Benjamin Reed created issue -
          Hide
          Benjamin Reed added a comment -

          The attached patch implements the method of splitting GZipped files as outlined in the issue description. It uses the same hooks as BZip. We need to review to make sure it terminates properly.

          If the gzipped file is not setup for splits, we fall back to not splitting the file.

          An unsplittable gzipped dataset can be converted to a splittable one with the following Pig Latin:

          a = load 'orig.gz';
          store a into 'splittable.gz';

          Show
          Benjamin Reed added a comment - The attached patch implements the method of splitting GZipped files as outlined in the issue description. It uses the same hooks as BZip. We need to review to make sure it terminates properly. If the gzipped file is not setup for splits, we fall back to not splitting the file. An unsplittable gzipped dataset can be converted to a splittable one with the following Pig Latin: a = load 'orig.gz'; store a into 'splittable.gz';
          Benjamin Reed made changes -
          Field Original Value New Value
          Attachment gzip.patch [ 12370765 ]
          Hide
          Sam Pullara added a comment -

          Is there any reason you decided not to use the gzip ID instead of empty files? It seems like it would be better if people could generate these files themselves easily without using PIG at all. Each gzip file will start with "1F 8B 08 08" [1] if you use this mechanism to create them:

          gzip -c test1 test2 > test.gz [2]

          In the few times that it is wrong you will get an exception from your gzip stream and you can try again at the next boundary.

          [1] http://www.gzip.org/zlib/rfc-gzip.html
          [2] man gzip

          Show
          Sam Pullara added a comment - Is there any reason you decided not to use the gzip ID instead of empty files? It seems like it would be better if people could generate these files themselves easily without using PIG at all. Each gzip file will start with "1F 8B 08 08" [1] if you use this mechanism to create them: gzip -c test1 test2 > test.gz [2] In the few times that it is wrong you will get an exception from your gzip stream and you can try again at the next boundary. [1] http://www.gzip.org/zlib/rfc-gzip.html [2] man gzip
          Hide
          Benjamin Reed added a comment -

          There are two reasons I use an empty file with a comment:

          1) It allows me to test that a gzip file is infact splittable. We need to know up front that we can split the gzip file. If the gzip isn't split at regular intervals, it's going to waste a lot of time! The signature is more than a marker, it is meta-data that indicates that it can be split. You will also notice that if you do 'head' on the file you can see that it is splittable.

          2) It gives you a much more reliable signature. (20 bytes instead of 4)

          You can still use standard tools without using Pig:

          cat signature.gz > test.gz; gzip -c test1 >> test.gz; cat signature.gz >> test.gz; gzip -c test2 >> test.gz

          You use standard gunzip to decompress. You can also easily find the split boundaries outside of pig by looking for the signature.gz sequence.

          This also allows you to better control the grouping. If your gzip file is bigger than 4G, it will be a concatenation, so there may be time that you want to process concatenated gzip files together without splitting. Using the empty signature file allows you to do that.

          Now that I think about it more, it might also be good to reserve some bytes in the signature.gz to put a block size. That way when can do intelligent splits when the fs blocksize doesn't correspond to the gzip blocksize or the number of requested splits are very high.

          Show
          Benjamin Reed added a comment - There are two reasons I use an empty file with a comment: 1) It allows me to test that a gzip file is infact splittable. We need to know up front that we can split the gzip file. If the gzip isn't split at regular intervals, it's going to waste a lot of time! The signature is more than a marker, it is meta-data that indicates that it can be split. You will also notice that if you do 'head' on the file you can see that it is splittable. 2) It gives you a much more reliable signature. (20 bytes instead of 4) You can still use standard tools without using Pig: cat signature.gz > test.gz; gzip -c test1 >> test.gz; cat signature.gz >> test.gz; gzip -c test2 >> test.gz You use standard gunzip to decompress. You can also easily find the split boundaries outside of pig by looking for the signature.gz sequence. This also allows you to better control the grouping. If your gzip file is bigger than 4G, it will be a concatenation, so there may be time that you want to process concatenated gzip files together without splitting. Using the empty signature file allows you to do that. Now that I think about it more, it might also be good to reserve some bytes in the signature.gz to put a block size. That way when can do intelligent splits when the fs blocksize doesn't correspond to the gzip blocksize or the number of requested splits are very high.
          Hide
          Owen O'Malley added a comment -

          It seems a lot more friendly to define the format like:

          % touch empty
          % gzip -nc part0 empty part1 empty part2 empty part3 > big.sgz
          

          That would let the user do:

          % gzcat big.sgz
          

          to get their file back. I'd also use filenames rather than a header to reflect whether a file is in this format, but that is mostly just a personal preference.

          Show
          Owen O'Malley added a comment - It seems a lot more friendly to define the format like: % touch empty % gzip -nc part0 empty part1 empty part2 empty part3 > big.sgz That would let the user do: % gzcat big.sgz to get their file back. I'd also use filenames rather than a header to reflect whether a file is in this format, but that is mostly just a personal preference.
          Hide
          Benjamin Reed added a comment -

          There are two problems with just using an empty file.

          1) The signature is just too small to reliably detect the split. Misdetecting the split isn't as easy as retrying because it usually means you get an OutOfMemoryError are you may have already returned bad data.

          2) You have to revert to relying on a extension to detect splitability. This ends up being pretty hokey because most gzip utilities are looking for a .gz extension. The splittable gzip format is completely compatible with existing gzip utilities. Also, if a user puts the wrong extension splits may not happen when they could or we may try to split files that we cannot.

          Plus its really nice to be able to do a head file.gz and see right away whether the file is splittable or not.

          Show
          Benjamin Reed added a comment - There are two problems with just using an empty file. 1) The signature is just too small to reliably detect the split. Misdetecting the split isn't as easy as retrying because it usually means you get an OutOfMemoryError are you may have already returned bad data. 2) You have to revert to relying on a extension to detect splitability. This ends up being pretty hokey because most gzip utilities are looking for a .gz extension. The splittable gzip format is completely compatible with existing gzip utilities. Also, if a user puts the wrong extension splits may not happen when they could or we may try to split files that we cannot. Plus its really nice to be able to do a head file.gz and see right away whether the file is splittable or not.
          Hide
          Sam Pullara added a comment -

          Ok, I'm convinced. Ship it!

          Show
          Sam Pullara added a comment - Ok, I'm convinced. Ship it!
          Hide
          Olga Natkovich added a comment -

          Ben, how much testing did this code go through?

          Show
          Olga Natkovich added a comment - Ben, how much testing did this code go through?
          Hide
          Benjamin Reed added a comment -

          The patch is not ready to commit yet. It's a work in progress patch. I talked to Utkarash about this and it's missing a termination of the split. Currently each split will not terminate correctly.There is a termination hook that bzip uses that I need to latch into.

          Basically here are the things I need to add to finish:

          1) Terminate split processing correctly
          2) Add test cases
          3) Encode block size as part of the header so that we can get almost "perfect" splits. (For example a file that is compressed as 128M blocks should not be split on 64M boundaries even if the block size of the filesystem is 128M.)

          I'll try to get a committable patch this weekend.

          Show
          Benjamin Reed added a comment - The patch is not ready to commit yet. It's a work in progress patch. I talked to Utkarash about this and it's missing a termination of the split. Currently each split will not terminate correctly.There is a termination hook that bzip uses that I need to latch into. Basically here are the things I need to add to finish: 1) Terminate split processing correctly 2) Add test cases 3) Encode block size as part of the header so that we can get almost "perfect" splits. (For example a file that is compressed as 128M blocks should not be split on 64M boundaries even if the block size of the filesystem is 128M.) I'll try to get a committable patch this weekend.
          Hide
          Olga Natkovich added a comment -

          Looks like hadoop guys might do it soon: https://issues.apache.org/jira/browse/HADOOP-1824

          Show
          Olga Natkovich added a comment - Looks like hadoop guys might do it soon: https://issues.apache.org/jira/browse/HADOOP-1824
          Olga Natkovich made changes -
          Assignee Benjamin Reed [ breed ]
          Hide
          Olga Natkovich added a comment -

          cleared patch available flag since this patch is not yet ready for review

          Show
          Olga Natkovich added a comment - cleared patch available flag since this patch is not yet ready for review
          Olga Natkovich made changes -
          Patch Info [Patch Available]
          Owen O'Malley made changes -
          Workflow jira [ 12418372 ] no-reopen-closed, patch-avail [ 12425423 ]
          Hide
          Tom White added a comment -

          It would be nice if the format could be generated using standard tools. By modifying the gzip flag header so that it refers to the file name (which the gzip tool can set), rather than a comment (which it cannot) we can generate compatible files using the following:

          touch -mt 197007130719.25 Split
          gzip -c Split file1 Split file2 > file.gz
          

          Then the first split file has the following hexdump:

          hexdump -n 26 -C file.gz
          00000000  1f 8b 08 08 6d ca fe 00  00 03 53 70 6c 69 74 00  |....m.....Split.|
          00000010  03 00 00 00 00 00 00 00  00 00                    |..........|
          0000001a
          

          Note that the OS flag is 03 (Unix) rather than FF (unknown), but that should be OK as the code doesn't use it when searching for the signature.

          Show
          Tom White added a comment - It would be nice if the format could be generated using standard tools. By modifying the gzip flag header so that it refers to the file name (which the gzip tool can set), rather than a comment (which it cannot) we can generate compatible files using the following: touch -mt 197007130719.25 Split gzip -c Split file1 Split file2 > file.gz Then the first split file has the following hexdump: hexdump -n 26 -C file.gz 00000000 1f 8b 08 08 6d ca fe 00 00 03 53 70 6c 69 74 00 |....m.....Split.| 00000010 03 00 00 00 00 00 00 00 00 00 |..........| 0000001a Note that the OS flag is 03 (Unix) rather than FF (unknown), but that should be OK as the code doesn't use it when searching for the signature.
          Tom White made changes -
          Link This issue relates to HADOOP-5014 [ HADOOP-5014 ]
          Hide
          David Ciemiewicz added a comment -

          Hadoop Archives are not really the solution here. I want my code to with exactly the same file name references whether I have 100 gzip compressed (or bzip2 compressed) part files or a single concatenation of the individually compressed part files.

          I have to change all my filename references to use a har.

          What we really want are simple concatenations of gzip files and bzip2 files that work with map reduce.

          Show
          David Ciemiewicz added a comment - Hadoop Archives are not really the solution here. I want my code to with exactly the same file name references whether I have 100 gzip compressed (or bzip2 compressed) part files or a single concatenation of the individually compressed part files. I have to change all my filename references to use a har. What we really want are simple concatenations of gzip files and bzip2 files that work with map reduce.
          Hide
          Olga Natkovich added a comment -

          Pig no longer deal with compression. It is now up to individual loaders/storers to do this

          Show
          Olga Natkovich added a comment - Pig no longer deal with compression. It is now up to individual loaders/storers to do this
          Olga Natkovich made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Won't Fix [ 2 ]
          Greg Roelofs made changes -
          Link This issue relates to MAPREDUCE-1795 [ MAPREDUCE-1795 ]

            People

            • Assignee:
              Benjamin Reed
              Reporter:
              Benjamin Reed
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development