There are two reasons I use an empty file with a comment:
1) It allows me to test that a gzip file is infact splittable. We need to know up front that we can split the gzip file. If the gzip isn't split at regular intervals, it's going to waste a lot of time! The signature is more than a marker, it is meta-data that indicates that it can be split. You will also notice that if you do 'head' on the file you can see that it is splittable.
2) It gives you a much more reliable signature. (20 bytes instead of 4)
You can still use standard tools without using Pig:
cat signature.gz > test.gz; gzip -c test1 >> test.gz; cat signature.gz >> test.gz; gzip -c test2 >> test.gz
You use standard gunzip to decompress. You can also easily find the split boundaries outside of pig by looking for the signature.gz sequence.
This also allows you to better control the grouping. If your gzip file is bigger than 4G, it will be a concatenation, so there may be time that you want to process concatenated gzip files together without splitting. Using the empty signature file allows you to do that.
Now that I think about it more, it might also be good to reserve some bytes in the signature.gz to put a block size. That way when can do intelligent splits when the fs blocksize doesn't correspond to the gzip blocksize or the number of requested splits are very high.