Details
-
Improvement
-
Status: Resolved
-
P3
-
Resolution: Fixed
-
None
-
None
Description
We should cache the credential used within a Pipeline and re-use it instead of generating a new one on each GCS call. When executing (against 2.0.0 RC2):
python -m apache_beam.examples.wordcount --input "gs://dataflow-samples/shakespeare/*" --output local_counts
Note that we seemingly generate a new access token each time instead of when a refresh is required.
super(GcsIO, cls).__new__(cls, storage_client))
INFO:root:Starting the size estimation of the input
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
INFO:root:Finished the size estimation of the input at 1 files. Estimation took 0.286200046539 seconds
INFO:root:Running pipeline with DirectRunner.
INFO:root:Starting the size estimation of the input
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
INFO:root:Finished the size estimation of the input at 43 files. Estimation took 0.205624818802 seconds
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
... many more times ...