Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3681

NullpointerException while processing files in gzip format

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 0.11
    • None
    • None
    • None
    • Linux CentOS 6.4; CDH-4.x and CDH-5.0 (beta)

    Description

      When pig processes a large gzip file with text or mixed text and binary content, it throws a NullPointerException if the property texinputformat.record.delimiter is set to '\n'. This is because pig interprets the specified delimiter as a two character string "\" followed by "n" and not as a new line character.

      If this property is not set, same file unzips without problems, but the diff output of file unzipped using pig and unzipped using the gunzip command differs.

      Steps to recreate:

      1. create a text file that is ~ 4GB - I concatanated some pig/hadoop stdout and syslog files to create this file about 4GB in size.
      2. compress it on unix command line - Ex. gzip abc
      3. upload to hdfs (optional)
      4. run the pig script included below to read/write the file.

      pig --param job_name="gunzip abc" --param inputfile="abc.gz" --param outputdir=./test --param outputfile=abc gunzip.pig

      Here are the contents of gunzip.pig:
      set job.name '$job_name'

      set textinputformat.record.delimiter "\n";

      gzdata = LOAD '$inputfile' USING PigStorage();

      STORE gzdata INTO '$outputdir/$outputfile' USING PigStorage();

      This will cause the NullPointerException.

      If the second line (set textinputformat.record.delimiter field) is commented out, the Exception won't occur but the output is not the same as the one produced by gunzip.

      Attachments

        Activity

          People

            Unassigned Unassigned
            kumarr Kumar Ravi
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: