Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
0.11
-
None
-
None
-
None
-
Linux CentOS 6.4; CDH-4.x and CDH-5.0 (beta)
Description
When pig processes a large gzip file with text or mixed text and binary content, it throws a NullPointerException if the property texinputformat.record.delimiter is set to '\n'. This is because pig interprets the specified delimiter as a two character string "\" followed by "n" and not as a new line character.
If this property is not set, same file unzips without problems, but the diff output of file unzipped using pig and unzipped using the gunzip command differs.
Steps to recreate:
1. create a text file that is ~ 4GB - I concatanated some pig/hadoop stdout and syslog files to create this file about 4GB in size.
2. compress it on unix command line - Ex. gzip abc
3. upload to hdfs (optional)
4. run the pig script included below to read/write the file.
pig --param job_name="gunzip abc" --param inputfile="abc.gz" --param outputdir=./test --param outputfile=abc gunzip.pig
Here are the contents of gunzip.pig:
set job.name '$job_name'
set textinputformat.record.delimiter "\n";
gzdata = LOAD '$inputfile' USING PigStorage();
STORE gzdata INTO '$outputdir/$outputfile' USING PigStorage();
This will cause the NullPointerException.
If the second line (set textinputformat.record.delimiter field) is commented out, the Exception won't occur but the output is not the same as the one produced by gunzip.