Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-570

Large BZip files Seem to loose data in Pig

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.0.0, 0.1.0, 0.2.0, site
    • 0.2.0
    • None
    • None
    • Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2

    Description

      So I don't believe bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:

      • Maps seem to be completing in a unbelievably fast rate

      With uncompressed data
      Status: Succeeded
      Started at: Wed Dec 17 21:31:10 EST 2008
      Finished at: Wed Dec 17 22:42:09 EST 2008
      Finished in: 1hrs, 10mins, 59sec
      map 100.00%
      4670 0 0 4670 0 0 / 21
      reduce 57.72%
      13 0 0 13 0 0 / 4

      With bzip compressed data

      Started at: Wed Dec 17 21:17:28 EST 2008
      Failed at: Wed Dec 17 21:17:52 EST 2008
      Failed in: 24sec
      Black-listed TaskTrackers: 2
      Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed
      Task Attempts
      map 100.00%
      183 0 0 15 168 54 / 22
      reduce 100.00%
      13 0 0 0 13 0 / 0

      The errors we get:
      ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 0HAW, CHIX, )
      at org.apache.pig.data.Tuple.getField(Tuple.java:176)
      at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
      at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
      at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
      at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
      at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
      at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
      at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
      Last 4KB
      attempt_200812161759_0045_m_000007_0 task_200812161759_0045_m_000007 tsdhb06.factset.com FAILED
      java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
      at org.apache.pig.data.Tuple.getField(Tuple.java:176)
      at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
      at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
      at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
      at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
      at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
      at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
      at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

      Attachments

        1. bzipTest.bz2
          539 kB
          Benjamin Reed
        2. PIG-570.patch
          6 kB
          Benjamin Reed

        Activity

          People

            breed Benjamin Reed
            posix4e Alex Newman
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: