[HADOOP-11959] WASB should configure client side socket timeout in storage client blob request options - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 3.0.0-alpha1
Component/s: tools
Labels:
None

Target Version/s:

2.8.0
Hadoop Flags:

Reviewed

Description

On clusters/jobs where mapred.task.timeout is set to a larger value, we noticed that tasks can sometimes get stuck on the below stack.

Thread 1: (state = IN_NATIVE)
- java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Interpreted frame)
- java.net.SocketInputStream.read(byte[], int, int, int) @bci=87, line=152 (Interpreted frame)
- java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=122 (Interpreted frame)
- java.io.BufferedInputStream.fill() @bci=175, line=235 (Interpreted frame)
- java.io.BufferedInputStream.read1(byte[], int, int) @bci=44, line=275 (Interpreted frame)
- java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=334 (Interpreted frame)
- sun.net.www.MeteredStream.read(byte[], int, int) @bci=16, line=134 (Interpreted frame)
- java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=133 (Interpreted frame)
- sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(byte[], int, int) @bci=4, line=3053 (Interpreted frame)
- com.microsoft.azure.storage.core.NetworkInputStream.read(byte[], int, int) @bci=7, line=49 (Interpreted frame)
- com.microsoft.azure.storage.blob.CloudBlob$10.postProcessResponse(java.net.HttpURLConnection, com.microsoft.azure.storage.blob.CloudBlob, com.microsoft.azure
.storage.blob.CloudBlobClient, com.microsoft.azure.storage.OperationContext, java.lang.Integer) @bci=204, line=1691 (Interpreted frame)
- com.microsoft.azure.storage.blob.CloudBlob$10.postProcessResponse(java.net.HttpURLConnection, java.lang.Object, java.lang.Object, com.microsoft.azure.storage
.OperationContext, java.lang.Object) @bci=17, line=1613 (Interpreted frame)
- com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(java.lang.Object, java.lang.Object, com.microsoft.azure.storage.core.StorageRequest, com.mi
crosoft.azure.storage.RetryPolicyFactory, com.microsoft.azure.storage.OperationContext) @bci=352, line=148 (Interpreted frame)
- com.microsoft.azure.storage.blob.CloudBlob.downloadRangeInternal(long, java.lang.Long, byte[], int, com.microsoft.azure.storage.AccessCondition, com.microsof
t.azure.storage.blob.BlobRequestOptions, com.microsoft.azure.storage.OperationContext) @bci=131, line=1468 (Interpreted frame)
- com.microsoft.azure.storage.blob.BlobInputStream.dispatchRead(int) @bci=31, line=255 (Interpreted frame)
- com.microsoft.azure.storage.blob.BlobInputStream.readInternal(byte[], int, int) @bci=52, line=448 (Interpreted frame)
- com.microsoft.azure.storage.blob.BlobInputStream.read(byte[], int, int) @bci=28, line=420 (Interpreted frame)
- java.io.BufferedInputStream.read1(byte[], int, int) @bci=39, line=273 (Interpreted frame)
- java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=334 (Interpreted frame)
- java.io.DataInputStream.read(byte[], int, int) @bci=7, line=149 (Interpreted frame)
- org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsInputStream.read(byte[], int, int) @bci=10, line=734 (Interpreted frame)
- java.io.BufferedInputStream.read1(byte[], int, int) @bci=39, line=273 (Interpreted frame)
- java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=334 (Interpreted frame)
- java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted frame)
- org.apache.hadoop.util.LineReader.fillBuffer(java.io.InputStream, byte[], boolean) @bci=2, line=180 (Interpreted frame)
- org.apache.hadoop.util.LineReader.readDefaultLine(org.apache.hadoop.io.Text, int, int) @bci=64, line=216 (Compiled frame)
- org.apache.hadoop.util.LineReader.readLine(org.apache.hadoop.io.Text, int, int) @bci=19, line=174 (Interpreted frame)
- org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue() @bci=108, line=185 (Interpreted frame)
- org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue() @bci=13, line=553 (Interpreted frame)
- org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue() @bci=4, line=80 (Interpreted frame)
- org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue() @bci=4, line=91 (Interpreted frame)
- org.apache.hadoop.mapreduce.Mapper.run(org.apache.hadoop.mapreduce.Mapper$Context) @bci=6, line=144 (Interpreted frame)
- org.apache.hadoop.mapred.MapTask.runNewMapper(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.mapreduce.split.JobSplit$TaskSplitIndex, org.apache.hadoop.
mapred.TaskUmbilicalProtocol, org.apache.hadoop.mapred.Task$TaskReporter) @bci=228, line=784 (Interpreted frame)
- org.apache.hadoop.mapred.MapTask.run(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=148, line=341 (Interpreted frame)

- org.apache.hadoop.mapred.YarnChild$2.run() @bci=29, line=163 (Interpreted frame)
- java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Interpreted frame)
- javax.security.auth.Subject.doAs(javax.security.auth.Subject, java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted frame)
- org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction) @bci=14, line=1628 (Interpreted frame)
- org.apache.hadoop.mapred.YarnChild.main(java.lang.String[]) @bci=514, line=158 (Interpreted frame)

The issue is that the storage client is by default not setting the socket timeout on its HTTP connections causing that in some (rare) circumstances we encounter a deadlock (e.g. whether the server on the other side just dies unexpectedly).

The fix is to configure the maximum operation time on the storage client request options.

WASB should configure client side socket timeout in storage client blob request options

Details

Description

Attachments

Attachments

Activity

People

Dates