Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14325

[C++] S3FileSystem enable automatic temporary credential refreshing for AWS Instance Profile

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 5.0.0
    • None
    • C++

    Description

      Context: I am running pyarrow==5.0.0 on an AWS EC2 instance that is set up with an IAM role. AWS S3 credentials are provided via Instance Profiles, where my python application code (eg pyarrow) receives temporary S3 credentials (with a limited lifetime, eg  4 hours).

      For more info on this credential setup, see: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html#roles-usingrole-ec2instance-roles

       

      Problem: I am running a long-running pyarrow script on my EC2 instance (eg one that exceeds 4 hours in duration) that is streaming data from S3 the entire time. After ~4 hours, the script fails with a token expiration error:

       

      ...
      File "pyarrow/_dataset.pyx", line 3042, in _iterator
       File "pyarrow/_dataset.pyx", line 2813, in pyarrow._dataset.TaggedRecordBatchIterator.__next__
       File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
       File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
      OSError: Could not open Parquet input source 'some-bucket/some/path/to/some_file.parquet': AWS Error [code 100]: Unable to parse ExceptionName: ExpiredToken Message: The provided token has expired.
      

       

      Digging into the source code, I suspect that pyarrow's S3FileSystem is currently doing the following:

       

      # Highly simplified
      class S3FileSystem:
          def __init__(self):
              credentials_provider = Aws::Auth::DefaultAWSCredentialsProviderChain>()
          ...
      
      # in pyarrow.dataset code
      def create_dataset(s3_path: str, s3fs: S3FileSystem) -> pyarrow.dataset.Dataset:
          # Creates TEMPORARY credentials that will expire in ~4 hours
          # Notably, pyarrow never tries to REFRESH these temp creds, which means
          #   that this returned Dataset will start failing after cred expiration, eg
          #   after ~4 hours
          aws_session_token, aws_secret_access_key, aws_access_key_id = s3fs.credentials_provider.get_credentials()
          return create_dataset_from_s3(s3_path, s3fs, aws_session_token, aws_secret_access_key, aws_access_key_id)
      
      

       

      Feature request: it'd be really great if pyarrow.fs.S3FileSystem could auto-refresh temporary credentials.

      It's worth noting that with "typical" usage of AWS S3 SDK/Tools, the S3 temporary credentials are transparently constantly refreshed when using IAM roles + Instance Profiles (the AWS S3 SDK should, when the temp credentials expire, auto-regenerate the credentials from the IMDS, eg see:

      https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials

       

      Additional notes:

      I did some initial digging into how S3FileSystem uses the AWS SDK credential providers, and I'm 99% sure that the current default credential provider does NOT support auto credential refreshing:

      1. pyarrow by default will use this DefaultAWSCredentialsProviderChain, which will (in my case) fall back to EC2ContainerCredentialsProviderWrapper
        https://github.com/apache/arrow/blob/5c6f05f2bcc779a9ba82ba6920acfb7fd1ab6cd9/cpp/src/arrow/filesystem/s3fs.cc#L208
        https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.java
        https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.java#L46
      1. which uses this InstanceProfileCredentialsProvider
        https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/InstanceProfileCredentialsProvider.java#L34
      1. i THINK that this does NOT implement temp cred refreshing, which could explain why my job died after a few hours:
        https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/InstanceMetadataServiceCredentialsFetcher.java#L70
      1. on the other hand, pyarrow's arn_role follows a different chain, using the StsAssumeRoleCredentialsProvider and notably passes a `load_frequency` arg, and does seem to have temp cred refresh enabled
        https://github.com/apache/arrow/blob/5c6f05f2bcc779a9ba82ba6920acfb7fd1ab6cd9/cpp/src/arrow/filesystem/s3fs.cc#L227
        https://github.com/aws/aws-sdk-java-v2/blob/master/services/sts/src/main/java/software/amazon/awssdk/services/sts/auth/StsAssumeRoleCredentialsProvider.java#L43

       

      Finally: this PR did add support for automatic temporary credential refreshing, but this is ONLY for the "arn_role" (assume ARN IAM role) code path: https://github.com/apache/arrow/pull/7803

      Sadly, for my use case I can't use the "arn_role" code path since my EC2 instance has already assumed the required IAM role, and AWS does not play nicely with assuming the same role you already have.

       

      I'm not aware of any workarounds, other than possibly "hot swapping" out the S3FileSystem credential provider instance with a "fresh" one when my user code detects that the temporary credentials have expired. Not sure if that's even possible though.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            erickim555 Eric Kim
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: