Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
5.0.0
-
None
Description
Context: I am running pyarrow==5.0.0 on an AWS EC2 instance that is set up with an IAM role. AWS S3 credentials are provided via Instance Profiles, where my python application code (eg pyarrow) receives temporary S3 credentials (with a limited lifetime, eg 4 hours).
For more info on this credential setup, see: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html#roles-usingrole-ec2instance-roles
Problem: I am running a long-running pyarrow script on my EC2 instance (eg one that exceeds 4 hours in duration) that is streaming data from S3 the entire time. After ~4 hours, the script fails with a token expiration error:
... File "pyarrow/_dataset.pyx", line 3042, in _iterator File "pyarrow/_dataset.pyx", line 2813, in pyarrow._dataset.TaggedRecordBatchIterator.__next__ File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status OSError: Could not open Parquet input source 'some-bucket/some/path/to/some_file.parquet': AWS Error [code 100]: Unable to parse ExceptionName: ExpiredToken Message: The provided token has expired.
Digging into the source code, I suspect that pyarrow's S3FileSystem is currently doing the following:
# Highly simplified class S3FileSystem: def __init__(self): credentials_provider = Aws::Auth::DefaultAWSCredentialsProviderChain>() ... # in pyarrow.dataset code def create_dataset(s3_path: str, s3fs: S3FileSystem) -> pyarrow.dataset.Dataset: # Creates TEMPORARY credentials that will expire in ~4 hours # Notably, pyarrow never tries to REFRESH these temp creds, which means # that this returned Dataset will start failing after cred expiration, eg # after ~4 hours aws_session_token, aws_secret_access_key, aws_access_key_id = s3fs.credentials_provider.get_credentials() return create_dataset_from_s3(s3_path, s3fs, aws_session_token, aws_secret_access_key, aws_access_key_id)
Feature request: it'd be really great if pyarrow.fs.S3FileSystem could auto-refresh temporary credentials.
It's worth noting that with "typical" usage of AWS S3 SDK/Tools, the S3 temporary credentials are transparently constantly refreshed when using IAM roles + Instance Profiles (the AWS S3 SDK should, when the temp credentials expire, auto-regenerate the credentials from the IMDS, eg see:
Additional notes:
I did some initial digging into how S3FileSystem uses the AWS SDK credential providers, and I'm 99% sure that the current default credential provider does NOT support auto credential refreshing:
- pyarrow by default will use this DefaultAWSCredentialsProviderChain, which will (in my case) fall back to EC2ContainerCredentialsProviderWrapper
https://github.com/apache/arrow/blob/5c6f05f2bcc779a9ba82ba6920acfb7fd1ab6cd9/cpp/src/arrow/filesystem/s3fs.cc#L208
https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.java
https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.java#L46
- which uses this InstanceProfileCredentialsProvider
https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/InstanceProfileCredentialsProvider.java#L34
- i THINK that this does NOT implement temp cred refreshing, which could explain why my job died after a few hours:
https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/InstanceMetadataServiceCredentialsFetcher.java#L70
- on the other hand, pyarrow's arn_role follows a different chain, using the StsAssumeRoleCredentialsProvider and notably passes a `load_frequency` arg, and does seem to have temp cred refresh enabled
https://github.com/apache/arrow/blob/5c6f05f2bcc779a9ba82ba6920acfb7fd1ab6cd9/cpp/src/arrow/filesystem/s3fs.cc#L227
https://github.com/aws/aws-sdk-java-v2/blob/master/services/sts/src/main/java/software/amazon/awssdk/services/sts/auth/StsAssumeRoleCredentialsProvider.java#L43
Finally: this PR did add support for automatic temporary credential refreshing, but this is ONLY for the "arn_role" (assume ARN IAM role) code path: https://github.com/apache/arrow/pull/7803
Sadly, for my use case I can't use the "arn_role" code path since my EC2 instance has already assumed the required IAM role, and AWS does not play nicely with assuming the same role you already have.
I'm not aware of any workarounds, other than possibly "hot swapping" out the S3FileSystem credential provider instance with a "fresh" one when my user code detects that the temporary credentials have expired. Not sure if that's even possible though.