Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11959

WASB should configure client side socket timeout in storage client blob request options

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: tools
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      On clusters/jobs where mapred.task.timeout is set to a larger value, we noticed that tasks can sometimes get stuck on the below stack.

      Thread 1: (state = IN_NATIVE)
      - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Interpreted frame)
      - java.net.SocketInputStream.read(byte[], int, int, int) @bci=87, line=152 (Interpreted frame)
      - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=122 (Interpreted frame)
      - java.io.BufferedInputStream.fill() @bci=175, line=235 (Interpreted frame)
      - java.io.BufferedInputStream.read1(byte[], int, int) @bci=44, line=275 (Interpreted frame)
      - java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=334 (Interpreted frame)
      - sun.net.www.MeteredStream.read(byte[], int, int) @bci=16, line=134 (Interpreted frame)
      - java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=133 (Interpreted frame)
      - sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(byte[], int, int) @bci=4, line=3053 (Interpreted frame)
      - com.microsoft.azure.storage.core.NetworkInputStream.read(byte[], int, int) @bci=7, line=49 (Interpreted frame)
      - com.microsoft.azure.storage.blob.CloudBlob$10.postProcessResponse(java.net.HttpURLConnection, com.microsoft.azure.storage.blob.CloudBlob, com.microsoft.azure
      .storage.blob.CloudBlobClient, com.microsoft.azure.storage.OperationContext, java.lang.Integer) @bci=204, line=1691 (Interpreted frame)
      - com.microsoft.azure.storage.blob.CloudBlob$10.postProcessResponse(java.net.HttpURLConnection, java.lang.Object, java.lang.Object, com.microsoft.azure.storage
      .OperationContext, java.lang.Object) @bci=17, line=1613 (Interpreted frame)
      - com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(java.lang.Object, java.lang.Object, com.microsoft.azure.storage.core.StorageRequest, com.mi
      crosoft.azure.storage.RetryPolicyFactory, com.microsoft.azure.storage.OperationContext) @bci=352, line=148 (Interpreted frame)
      - com.microsoft.azure.storage.blob.CloudBlob.downloadRangeInternal(long, java.lang.Long, byte[], int, com.microsoft.azure.storage.AccessCondition, com.microsof
      t.azure.storage.blob.BlobRequestOptions, com.microsoft.azure.storage.OperationContext) @bci=131, line=1468 (Interpreted frame)
      - com.microsoft.azure.storage.blob.BlobInputStream.dispatchRead(int) @bci=31, line=255 (Interpreted frame)
      - com.microsoft.azure.storage.blob.BlobInputStream.readInternal(byte[], int, int) @bci=52, line=448 (Interpreted frame)
      - com.microsoft.azure.storage.blob.BlobInputStream.read(byte[], int, int) @bci=28, line=420 (Interpreted frame)
      - java.io.BufferedInputStream.read1(byte[], int, int) @bci=39, line=273 (Interpreted frame)
      - java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=334 (Interpreted frame)
      - java.io.DataInputStream.read(byte[], int, int) @bci=7, line=149 (Interpreted frame)
      - org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsInputStream.read(byte[], int, int) @bci=10, line=734 (Interpreted frame)
      - java.io.BufferedInputStream.read1(byte[], int, int) @bci=39, line=273 (Interpreted frame)
      - java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=334 (Interpreted frame)
      - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted frame)
      - org.apache.hadoop.util.LineReader.fillBuffer(java.io.InputStream, byte[], boolean) @bci=2, line=180 (Interpreted frame)
      - org.apache.hadoop.util.LineReader.readDefaultLine(org.apache.hadoop.io.Text, int, int) @bci=64, line=216 (Compiled frame)
      - org.apache.hadoop.util.LineReader.readLine(org.apache.hadoop.io.Text, int, int) @bci=19, line=174 (Interpreted frame)
      - org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue() @bci=108, line=185 (Interpreted frame)
      - org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue() @bci=13, line=553 (Interpreted frame)
      - org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue() @bci=4, line=80 (Interpreted frame)
      - org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue() @bci=4, line=91 (Interpreted frame)
      - org.apache.hadoop.mapreduce.Mapper.run(org.apache.hadoop.mapreduce.Mapper$Context) @bci=6, line=144 (Interpreted frame)
      - org.apache.hadoop.mapred.MapTask.runNewMapper(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.mapreduce.split.JobSplit$TaskSplitIndex, org.apache.hadoop.
      mapred.TaskUmbilicalProtocol, org.apache.hadoop.mapred.Task$TaskReporter) @bci=228, line=784 (Interpreted frame)
      - org.apache.hadoop.mapred.MapTask.run(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=148, line=341 (Interpreted frame)
      
      - org.apache.hadoop.mapred.YarnChild$2.run() @bci=29, line=163 (Interpreted frame)
      - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Interpreted frame)
      - javax.security.auth.Subject.doAs(javax.security.auth.Subject, java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted frame)
      - org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction) @bci=14, line=1628 (Interpreted frame)
      - org.apache.hadoop.mapred.YarnChild.main(java.lang.String[]) @bci=514, line=158 (Interpreted frame)
      

      The issue is that the storage client is by default not setting the socket timeout on its HTTP connections causing that in some (rare) circumstances we encounter a deadlock (e.g. whether the server on the other side just dies unexpectedly).

      The fix is to configure the maximum operation time on the storage client request options.

        Attachments

        1. HADOOP-11959.2.patch
          8 kB
          Ivan Mitic
        2. HADOOP-11959.patch
          0.4 kB
          Ivan Mitic

          Activity

            People

            • Assignee:
              ivanmi Ivan Mitic
              Reporter:
              ivanmi Ivan Mitic
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: