Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-31492

AWS Firehose Connector misclassifies IAM permission exceptions as retryable

    XMLWordPrintableJSON

Details

    Description

      The AWS Firehose connector uses an exception classification mechanism to decide if errors writing requests to AWS Firehose are fatal (i.e. non-retryable) or not (i.e. retryable).

      private boolean isRetryable(Throwable err) {
          if (!FIREHOSE_FATAL_EXCEPTION_CLASSIFIER.isFatal(err, getFatalExceptionCons())) {
              return false;
          }
          if (failOnError) {
              getFatalExceptionCons()
                      .accept(new KinesisFirehoseException.KinesisFirehoseFailFastException(err));
              return false;
          }
      
          return true;
      } 

      (github)

      This exception classification mechanism compares an exception's actual type with known, fatal exception types (by using Flink's FatalExceptionClassifier.withExceptionClassifier).  An exception is considered fatal if it is assignable to a given known fatal exception (code).

      The AWS Firehose SDK throws fatal IAM permission exceptions as FirehoseExceptions, e.g.

      software.amazon.awssdk.services.firehose.model.FirehoseException: User: arn:aws:sts::000000000000:assumed-role/example-role/kiam-kiam is not authorized to perform: firehose:PutRecordBatch on resource: arn:aws:firehose:us-east-1:000000000000:deliverystream/example-stream because no identity-based policy allows the firehose:PutRecordBatch action

      At the same time, certain subtypes of FirehoseException are retryable and non-fatal (e.g.https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/firehose/model/LimitExceededException.html).

      The AWS Firehose connector currently wrongly classifies the fatal IAM permission exception as non-fatal. However, the current exception classification mechanism does not easily handle a case where a super-type should be considered fatal, but its child type shouldn't.

      To address this issue, AWS services and the AWS SDK use error codes (see e.g. Firehose's error codes or S3's error codes, see API docs here and here) to uniquely identify error conditions and to be used to handle errors by type.

      The AWS Firehose connector (and other AWS connectors) currently log to debug when retrying fully failed records (code). This makes it difficult for users to root cause the above issue without enabling debug logs.

       

       

       

      Attachments

        Activity

          People

            samuelsiebenmann Samuel Siebenmann
            samuelsiebenmann Samuel Siebenmann
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: