Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.3
-
None
-
None
-
None
Description
We get fatal failures from S3guard (that in turn fail our spark jobs) because of the inernal DynamoDB system errors.
com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: 00EBRE6J6V8UGD7040C9DUP2MNVV4KQNSO5AEMVJF66Q9ASUAAJG): Internal server error (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: 00EBRE6J6V8UGD7040C9DUP2MNVV4KQNSO5AEMVJF66Q9ASUAAJG)
The DynamoDB has separate statistic for system errors:
I contacted the AWS Support and got an explanation that those 500 errors are returned to the client once DynamoDB gets overwhelmed with client requests.
So essentially the traffic should had been throttled but it didn't and got 500 system errors.
My point is that the client should handle those errors just like throttling exceptions -
with exponential backoff retries.
Here is more complete exception stack trace:
org.apache.hadoop.fs.s3a.AWSServiceIOException: get on s3a://rem-spark/persisted_step_data/15/0afb1ccb73854f1fa55517a77ec7cc5e_b67e2221-f0e3-4c89-90ab-f49618ea4557_SDTopology/parquet.all_ranges/topo_id=321: com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: 00EBRE6J6V8UGD7040C9DUP2MNVV4KQNSO5AEMVJF66Q9ASUAAJG): Internal server error (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: 00EBRE6J6V8UGD7040C9DUP2MNVV4KQNSO5AEMVJF66Q9ASUAAJG) at org.apache.hadoop.fs.s3a.S3AUtils.translateDynamoDBException(S3AUtils.java:389) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:181) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:111) at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.get(DynamoDBMetadataStore.java:438) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2110) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerListStatus(S3AFileSystem.java:1889) at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listStatus$9(S3AFileSystem.java:1868) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109) at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:1868) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$listLeafFiles(InMemoryFileIndex.scala:277) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3$$anonfun$apply$2.apply(InMemoryFileIndex.scala:207) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3$$anonfun$apply$2.apply(InMemoryFileIndex.scala:206) at scala.collection.immutable.Stream.map(Stream.scala:418) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3.apply(InMemoryFileIndex.scala:206) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3.apply(InMemoryFileIndex.scala:204) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: 00EBRE6J6V8UGD7040C9DUP2MNVV4KQNSO5AEMVJF66Q9ASUAAJG) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1639) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:2925) at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:2901) at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeGetItem(AmazonDynamoDBClient.java:1640) at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.getItem(AmazonDynamoDBClient.java:1616) at com.amazonaws.services.dynamodbv2.document.internal.GetItemImpl.doLoadItem(GetItemImpl.java:77) at com.amazonaws.services.dynamodbv2.document.internal.GetItemImpl.getItem(GetItemImpl.java:66) at com.amazonaws.services.dynamodbv2.document.Table.getItem(Table.java:608) at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.getConsistentItem(DynamoDBMetadataStore.java:423) at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.innerGet(DynamoDBMetadataStore.java:459) at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.lambda$get$2(DynamoDBMetadataStore.java:439) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109) ... 29 more
Attachments
Attachments
Issue Links
- relates to
-
HADOOP-15426 Make S3guard client resilient to DDB throttle events and network failures
- Resolved