[PIG-4533] Document error: Pig does support concatenated gz file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.16.0
Component/s: documentation, parser
Labels:
None

Hadoop Flags:

Reviewed

Description

Documentation (since 0.11.1 at least) says :
http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
"Note: PigStorage and TextLoader correctly read compressed files as long as they are NOT CONCATENATED FILES generated in this manner: ..."

This is not true for gz, since

I did a test - concatenated&compress some files and processed them. The same was done with the raw files (no compression). The results were identical

Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and
https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively for both: bz2 and gz. That said Hadoop (1 and 2) are supporting concatenated archives bz2, gz already.

Pig deals with bz2 on its own(historical reasons) which is redundant to hadoop-common. Therefore this activity should be left to hadoop-common (there is no need to be handled by Pig anymore).

The documentation needs to be updated accordingly (concatenated gz, bz2 are processing correctly with hadoop-commons). Also a remark that tar.gz and tar.bz2 are not supported would be helpful since many users are using tar.gz or tar.bz2 automatically.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-4533-1.patch
12/Jun/15 18:06
1 kB
Daniel Dai

Issue Links

relates to

PIG-3251 Bzip2TextInputFormat requires double the memory of maximum record size

Closed

Activity

People

Assignee:: Daniel Dai

Reporter:: Tomas Hudik

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/May/15 10:16

Updated:: 08/Jun/16 20:48

Resolved:: 16/Jun/15 21:29