Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11730

Regression: s3n read failure recovery broken

    XMLWordPrintableJSON

Details

    Description

      s3n attempts to read again when it encounters IOException during read. But the current logic does not reopen the connection, thus, it ends up with no-op, and committing the wrong(truncated) output.

      Here's a stack trace as an example.

      2015-03-13 20:17:24,835 [TezChild] INFO org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor - Starting output org.apache.tez.mapreduce.output.MROutput@52008dbd to vertex scope-12
      2015-03-13 20:17:24,866 [TezChild] DEBUG org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream - Released HttpMethod as its response data stream threw an exception
      org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 296587138; received: 155648
      at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:184)
      at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
      at org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:78)
      at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:146)
      at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.read(NativeS3FileSystem.java:145)
      at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
      at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
      at java.io.DataInputStream.read(DataInputStream.java:100)
      at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
      at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
      at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
      at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:185)
      at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:259)
      at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:204)
      at org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:116)
      at org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTuple(POSimpleTezLoad.java:106)
      at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
      at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:246)
      at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
      at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNextTuple(POFilter.java:91)
      at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
      at org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:117)
      at org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:313)
      at org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:192)
      at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
      at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176)
      at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:415)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
      at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)
      at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)
      at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)
      2015-03-13 20:17:24,867 [TezChild] INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem - Received IOException while reading 'user/hadoop/tsato/readlarge/input/cloudian-s3.log.20141119', attempting to reopen.
      2015-03-13 20:17:24,867 [TezChild] DEBUG org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream - Released HttpMethod as its response data stream is fully consumed
      2015-03-13 20:17:24,868 [TezChild] INFO org.apache.tez.dag.app.TaskAttemptListenerImpTezDag - Commit go/no-go request from attempt_1426245338920_0001_1_00_000004_0
      2015-03-13 20:17:24,868 [TezChild] INFO org.apache.tez.dag.app.dag.impl.TaskImpl - attempt_1426245338920_0001_1_00_000004_0 given a go for committing the task output.

      It seems this is a regression, which was introduced by the following optimizations.

      https://issues.apache.org/jira/browse/HADOOP-10589
      https://issues.apache.org/jira/browse/HADOOP-10457

      Also, test cases should be reviewed so that it covers this scenario.

      Attachments

        1. HADOOP-11730-branch-2.6.0.001.patch
          5 kB
          Tadayoshi Sato
        2. HADOOP-11730-002.patch
          6 kB
          Steve Loughran

        Issue Links

          Activity

            People

              tsato Tadayoshi Sato
              tsato Tadayoshi Sato
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: