Pig
  1. Pig
  2. PIG-3480

TFile-based tmpfile compression crashes in some cases

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.1
    • Component/s: None
    • Labels:
      None

      Description

      When pig tmpfile compression is on, some jobs fail inside core hadoop internals.
      Suspect TFile is the problem, because an experiment in replacing TFile with SequenceFile succeeded.

      1. PIG-3480-7.patch
        41 kB
        Aniket Mokashi
      2. PIG-3480-6.patch
        41 kB
        Aniket Mokashi
      3. PIG-3480-5.patch
        39 kB
        Aniket Mokashi
      4. PIG-3480-4.patch
        34 kB
        Aniket Mokashi
      5. PIG-3480-3.patch
        34 kB
        Aniket Mokashi
      6. PIG-3480-2.patch
        33 kB
        Aniket Mokashi
      7. PIG-3480.patch
        10 kB
        Dmitriy V. Ryaboy

        Issue Links

          Activity

          Dmitriy V. Ryaboy created issue -
          Hide
          Dmitriy V. Ryaboy added a comment - - edited

          For most of the tasks that fail, no stack trace is available on Hadoop 1 (they just die with "nonzero status 134").

          I did catch one task with a stack trace:

          java.io.IOException: Error while reading compressed data at
          org.apache.hadoop.io.IOUtils.wrappedReadForCompressedData(IOUtils.java:205) at 
          org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:342) at 
          org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:373) at 
          org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:357) at 
          org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:389) at 
          org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at 
          org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) at 
          org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at 
          org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at 
          org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) at 
          org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at 
          org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at 
          org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at org.apache.hadoop.mapred.MapTask.run(Map
          

          No idea if this is relevant.

          This problem does happen consistently – 100% of the time on my script that shows this problem. Anecdotally, about 1/10 of our production scripts encounter this; I have not been able to establish a pattern yet.

          Show
          Dmitriy V. Ryaboy added a comment - - edited For most of the tasks that fail, no stack trace is available on Hadoop 1 (they just die with "nonzero status 134"). I did catch one task with a stack trace: java.io.IOException: Error while reading compressed data at org.apache.hadoop.io.IOUtils.wrappedReadForCompressedData(IOUtils.java:205) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:342) at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:373) at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:357) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:389) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at org.apache.hadoop.mapred.MapTask.run(Map No idea if this is relevant. This problem does happen consistently – 100% of the time on my script that shows this problem. Anecdotally, about 1/10 of our production scripts encounter this; I have not been able to establish a pattern yet.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Attaching a rough patch which replaces use of TFile with SequenceFile.

          Next steps:

          • evaluate effect on size of compressed data for TFile vs SeqFile when TFile does work
          • add tests, make TFile tests pass (in this file they fail, because of course TFile is not being used)
          • make SeqFile the default method, since it doesn't break
          • allow TFile use by a switch, since current users may want to keep it. I would prefer to not do that, but might if the first step shows significant differences.

          Thoughts?
          Especially from folks using TFile-based compression in production (Rohini Palaniswamy?)

          Show
          Dmitriy V. Ryaboy added a comment - Attaching a rough patch which replaces use of TFile with SequenceFile. Next steps: evaluate effect on size of compressed data for TFile vs SeqFile when TFile does work add tests, make TFile tests pass (in this file they fail, because of course TFile is not being used) make SeqFile the default method, since it doesn't break allow TFile use by a switch, since current users may want to keep it. I would prefer to not do that, but might if the first step shows significant differences. Thoughts? Especially from folks using TFile-based compression in production ( Rohini Palaniswamy ?)
          Dmitriy V. Ryaboy made changes -
          Field Original Value New Value
          Attachment PIG-3480.patch [ 12604855 ]
          Hide
          Koji Noguchi added a comment -

          Dmitriy, isn't your stacktrace failing at mapred.IFile and not TFile?

          > This problem does happen consistently – 100% of the time on my script that shows this problem.
          >
          And this problem goes away once tmpcompression is turned off?
          (pig.tmpfilecompression=false)

          Show
          Koji Noguchi added a comment - Dmitriy, isn't your stacktrace failing at mapred.IFile and not TFile? > This problem does happen consistently – 100% of the time on my script that shows this problem. > And this problem goes away once tmpcompression is turned off? (pig.tmpfilecompression=false)
          Hide
          Dmitriy V. Ryaboy added a comment -

          Koji Noguchi yeah, I'm not sure the stack trace is relevant – it's the only part that's not consistent about this.

          The problem goes away when I set pig.tmpfilecompression to false, or when I replace TFile with SequenceFile.
          I've also seen stack traces that were inside TFile, and had to do with some LZO decoding issues.. the actual error is really hard to capture, other than the fact that mappers fail consistently.

          Show
          Dmitriy V. Ryaboy added a comment - Koji Noguchi yeah, I'm not sure the stack trace is relevant – it's the only part that's not consistent about this. The problem goes away when I set pig.tmpfilecompression to false, or when I replace TFile with SequenceFile. I've also seen stack traces that were inside TFile, and had to do with some LZO decoding issues.. the actual error is really hard to capture, other than the fact that mappers fail consistently.
          Hide
          Rohini Palaniswamy added a comment -

          Dmitriy V. Ryaboy,
          We have been running it with pig.tmpfilecompression=true and pig.tmpfilecompression.codec=lzo as defaults from 2010 and have not encountered any issues so far. The problem might be elsewhere and TFile might not be the issue.

          Show
          Rohini Palaniswamy added a comment - Dmitriy V. Ryaboy , We have been running it with pig.tmpfilecompression=true and pig.tmpfilecompression.codec=lzo as defaults from 2010 and have not encountered any issues so far. The problem might be elsewhere and TFile might not be the issue.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Rohini I suspect this might be something about complex data types, which afaik are pretty rare at Y! and extremely common at Twitter.

          Show
          Dmitriy V. Ryaboy added a comment - Rohini I suspect this might be something about complex data types, which afaik are pretty rare at Y! and extremely common at Twitter.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Rohini, do you guys use lzo or gz compression? Maybe it's just lzo that's breaking. I can test gz. That never actually occurred to me, I just assumed this is completely busted because I could never get it to work (since 2010..)

          Show
          Dmitriy V. Ryaboy added a comment - Rohini, do you guys use lzo or gz compression? Maybe it's just lzo that's breaking. I can test gz. That never actually occurred to me, I just assumed this is completely busted because I could never get it to work (since 2010..)
          Hide
          Olga Natkovich added a comment -

          Could this be related to Hadoop version?

          Show
          Olga Natkovich added a comment - Could this be related to Hadoop version?
          Hide
          Rohini Palaniswamy added a comment -

          We do have complex types like bag of maps and bag of bags and one or two levels of nesting. But I assume you have way more nesting than we do. Does that matter though as what is written to TFile is just bytes for both key and value?

          We use lzo. It would be good to try gz and see if the problem is with lzo for you.

          2013-09-24 21:10:21,289 INFO [main] com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library
          2013-09-24 21:10:21,291 INFO [main] com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library
          2013-09-24 21:10:21,293 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.lzo_deflate]

          I don't think hadoop version should matter as we had hadoop 1.x till mid 2012.

          Show
          Rohini Palaniswamy added a comment - We do have complex types like bag of maps and bag of bags and one or two levels of nesting. But I assume you have way more nesting than we do. Does that matter though as what is written to TFile is just bytes for both key and value? We use lzo. It would be good to try gz and see if the problem is with lzo for you. 2013-09-24 21:10:21,289 INFO [main] com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library 2013-09-24 21:10:21,291 INFO [main] com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library 2013-09-24 21:10:21,293 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.lzo_deflate] I don't think hadoop version should matter as we had hadoop 1.x till mid 2012.
          Hide
          Aniket Mokashi added a comment -

          evaluate effect on size of compressed data for TFile vs SeqFile when TFile does work

          https://issues.apache.org/jira/browse/HADOOP-3315?focusedCommentId=12631905&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12631905 has some benchmark details for SequenceFile vs TFile.

          add tests, make TFile tests pass (in this file they fail, because of course TFile is not being used)

          I will submit a patch for this.

          make SeqFile the default method, since it doesn't break

          +1 for this as the effect is not substantially worse.

          allow TFile use by a switch, since current users may want to keep it. I would prefer to not do that, but might if the first step shows significant differences.

          Rohini Palaniswamy, what are your thoughts on this?

          Show
          Aniket Mokashi added a comment - evaluate effect on size of compressed data for TFile vs SeqFile when TFile does work https://issues.apache.org/jira/browse/HADOOP-3315?focusedCommentId=12631905&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12631905 has some benchmark details for SequenceFile vs TFile. add tests, make TFile tests pass (in this file they fail, because of course TFile is not being used) I will submit a patch for this. make SeqFile the default method, since it doesn't break +1 for this as the effect is not substantially worse. allow TFile use by a switch, since current users may want to keep it. I would prefer to not do that, but might if the first step shows significant differences. Rohini Palaniswamy , what are your thoughts on this?
          Hide
          Rohini Palaniswamy added a comment -

          Aniket Mokashi,
          Would prefer having TFile as the default and Sequence file as an option. We have not had any issues with it for years and may be few others too it has been the default. Also the performance numbers are better for it by 10-40% with compression according to HADOOP-3315 that you have referred.

          And it would also be good to have the actual cause of failure with TFile investigated if possible to see if something is not being done right in Pig as TFile just writes byte[] for keys and values.

          Show
          Rohini Palaniswamy added a comment - Aniket Mokashi , Would prefer having TFile as the default and Sequence file as an option. We have not had any issues with it for years and may be few others too it has been the default. Also the performance numbers are better for it by 10-40% with compression according to HADOOP-3315 that you have referred. And it would also be good to have the actual cause of failure with TFile investigated if possible to see if something is not being done right in Pig as TFile just writes byte[] for keys and values.
          Hide
          Olga Natkovich added a comment -

          Agree with Rohini. Changing default just because we found a bug does not seem like a sound approach,

          Show
          Olga Natkovich added a comment - Agree with Rohini. Changing default just because we found a bug does not seem like a sound approach,
          Hide
          Dmitriy V. Ryaboy added a comment -

          That is fine with me, lets make sequence file optional. It will let people avoid the bug I am encountering, an also do things like use snappy compression.

          Show
          Dmitriy V. Ryaboy added a comment - That is fine with me, lets make sequence file optional. It will let people avoid the bug I am encountering, an also do things like use snappy compression.
          Daniel Dai made changes -
          Fix Version/s 0.12.1 [ 12324970 ]
          Fix Version/s 0.12.0 [ 12323380 ]
          Aniket Mokashi made changes -
          Attachment PIG-3480-2.patch [ 12607474 ]
          Aniket Mokashi made changes -
          Assignee Dmitriy V. Ryaboy [ dvryaboy ]
          Aniket Mokashi made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Aniket Mokashi made changes -
          Attachment PIG-3480-3.patch [ 12607478 ]
          Aniket Mokashi made changes -
          Attachment PIG-3480-3.patch [ 12607478 ]
          Aniket Mokashi made changes -
          Attachment PIG-3480-3.patch [ 12607480 ]
          Aniket Mokashi made changes -
          Attachment PIG-3480-4.patch [ 12607489 ]
          Aniket Mokashi made changes -
          Attachment PIG-3480-5.patch [ 12608073 ]
          Show
          Aniket Mokashi added a comment - https://reviews.apache.org/r/14552/
          Aniket Mokashi made changes -
          Attachment PIG-3480-6.patch [ 12608120 ]
          Aniket Mokashi made changes -
          Attachment PIG-3480-7.patch [ 12608235 ]
          Hide
          Aniket Mokashi added a comment -

          Committed to trunk and 0.12 branch.
          Thanks Dmitriy V. Ryaboy and Julien Le Dem!

          Show
          Aniket Mokashi added a comment - Committed to trunk and 0.12 branch. Thanks Dmitriy V. Ryaboy and Julien Le Dem !
          Aniket Mokashi made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Daniel Dai made changes -
          Link This issue breaks PIG-3530 [ PIG-3530 ]
          Prashant Kommireddi made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Dmitriy V. Ryaboy
              Reporter:
              Dmitriy V. Ryaboy
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development