Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4533

Document error: Pig does support concatenated gz file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.16.0
    • documentation, parser
    • None
    • Reviewed

    Description

      Documentation (since 0.11.1 at least) says :
      http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
      "Note: PigStorage and TextLoader correctly read compressed files as long as they are NOT CONCATENATED FILES generated in this manner: ..."

      This is not true for gz, since

      • I did a test - concatenated&compress some files and processed them. The same was done with the raw files (no compression). The results were identical

      Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and
      https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively for both: bz2 and gz. That said Hadoop (1 and 2) are supporting concatenated archives bz2, gz already.

      Pig deals with bz2 on its own(historical reasons) which is redundant to hadoop-common. Therefore this activity should be left to hadoop-common (there is no need to be handled by Pig anymore).

      The documentation needs to be updated accordingly (concatenated gz, bz2 are processing correctly with hadoop-commons). Also a remark that tar.gz and tar.bz2 are not supported would be helpful since many users are using tar.gz or tar.bz2 automatically.

      Attachments

        1. PIG-4533-1.patch
          1 kB
          Daniel Dai

        Issue Links

          Activity

            People

              daijy Daniel Dai
              xhudik Tomas Hudik
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: