Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
aws-connector-4.1.0
Description
The AWS Firehose connector uses an exception classification mechanism to decide if errors writing requests to AWS Firehose are fatal (i.e. non-retryable) or not (i.e. retryable).
private boolean isRetryable(Throwable err) { if (!FIREHOSE_FATAL_EXCEPTION_CLASSIFIER.isFatal(err, getFatalExceptionCons())) { return false; } if (failOnError) { getFatalExceptionCons() .accept(new KinesisFirehoseException.KinesisFirehoseFailFastException(err)); return false; } return true; }
(github)
This exception classification mechanism compares an exception's actual type with known, fatal exception types (by using Flink's FatalExceptionClassifier.withExceptionClassifier). An exception is considered fatal if it is assignable to a given known fatal exception (code).
The AWS Firehose SDK throws fatal IAM permission exceptions as FirehoseExceptions, e.g.
software.amazon.awssdk.services.firehose.model.FirehoseException: User: arn:aws:sts::000000000000:assumed-role/example-role/kiam-kiam is not authorized to perform: firehose:PutRecordBatch on resource: arn:aws:firehose:us-east-1:000000000000:deliverystream/example-stream because no identity-based policy allows the firehose:PutRecordBatch action
At the same time, certain subtypes of FirehoseException are retryable and non-fatal (e.g.https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/firehose/model/LimitExceededException.html).
The AWS Firehose connector currently wrongly classifies the fatal IAM permission exception as non-fatal. However, the current exception classification mechanism does not easily handle a case where a super-type should be considered fatal, but its child type shouldn't.
To address this issue, AWS services and the AWS SDK use error codes (see e.g. Firehose's error codes or S3's error codes, see API docs here and here) to uniquely identify error conditions and to be used to handle errors by type.
The AWS Firehose connector (and other AWS connectors) currently log to debug when retrying fully failed records (code). This makes it difficult for users to root cause the above issue without enabling debug logs.