[FLINK-31492] AWS Firehose Connector misclassifies IAM permission exceptions as retryable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: aws-connector-4.1.0
Fix Version/s: aws-connector-4.2.0
Component/s: Connectors / AWS, Connectors / Firehose
Labels:
- pull-request-available

Description

The AWS Firehose connector uses an exception classification mechanism to decide if errors writing requests to AWS Firehose are fatal (i.e. non-retryable) or not (i.e. retryable).

private boolean isRetryable(Throwable err) {
    if (!FIREHOSE_FATAL_EXCEPTION_CLASSIFIER.isFatal(err, getFatalExceptionCons())) {
        return false;
    }
    if (failOnError) {
        getFatalExceptionCons()
                .accept(new KinesisFirehoseException.KinesisFirehoseFailFastException(err));
        return false;
    }

    return true;
}

(github)

This exception classification mechanism compares an exception's actual type with known, fatal exception types (by using Flink's FatalExceptionClassifier.withExceptionClassifier). An exception is considered fatal if it is assignable to a given known fatal exception (code).

The AWS Firehose SDK throws fatal IAM permission exceptions as FirehoseExceptions, e.g.

software.amazon.awssdk.services.firehose.model.FirehoseException: User: arn:aws:sts::000000000000:assumed-role/example-role/kiam-kiam is not authorized to perform: firehose:PutRecordBatch on resource: arn:aws:firehose:us-east-1:000000000000:deliverystream/example-stream because no identity-based policy allows the firehose:PutRecordBatch action

At the same time, certain subtypes of FirehoseException are retryable and non-fatal (e.g.https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/firehose/model/LimitExceededException.html).

The AWS Firehose connector currently wrongly classifies the fatal IAM permission exception as non-fatal. However, the current exception classification mechanism does not easily handle a case where a super-type should be considered fatal, but its child type shouldn't.

To address this issue, AWS services and the AWS SDK use error codes (see e.g. Firehose's error codes or S3's error codes, see API docs here and here) to uniquely identify error conditions and to be used to handle errors by type.

The AWS Firehose connector (and other AWS connectors) currently log to debug when retrying fully failed records (code). This makes it difficult for users to root cause the above issue without enabling debug logs.

Attachments

Activity

People

Assignee:: Samuel Siebenmann

Reporter:: Samuel Siebenmann

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Mar/23 16:40

Updated:: 28/Mar/23 17:11

Resolved:: 28/Mar/23 17:11