Details
-
Improvement
-
Status: Resolved
-
P3
-
Resolution: Won't Fix
-
None
Description
When I run a pipeline, I believe everything (params, lexically scoped variables) must be pickleable for the individual processing stages.
I have to load a dependent datastore record in one of my processing pipelines. (Horribly inefficient, I know, but it's my DB design for now...)
A google.cloud.datastore.Client() is not serializable due to the google.cloud.datastore._http.Connection it contains, that is using GRPC:
File "lib/apache_beam/transforms/ptransform.py", line 474, in __init__ self.args = pickler.loads(pickler.dumps(self.args)) File "lib/apache_beam/internal/pickler.py", line 212, in loads return dill.loads(s) File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", line 277, in loads return load(file) File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", line 266, in load obj = pik.load() File "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 864, in load dispatch[key](self) File "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1089, in load_newobj obj = cls.__new__(cls, *args) File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 35, in grpc._cython.cygrpc.Channel.__cinit__ (src/python/grpcio/grpc/_cython/cygrpc.c:4022) TypeError: __cinit__() takes at least 2 positional arguments (0 given)
So instead, constructing a Client inside my pipeline...it appears to be jumping through hoops to recreate the Client, in that each execution of my pipeline is printing:
DEBUG:google_auth_httplib2:Making request: POST https://accounts.google.com/o/oauth2/token
I'm sure Google SRE would be very unhappy if I scaled up this mapreduce.
This is a tricky cross-team interaction issue (only occurs for those using google-cloud-datastore and apache-beam google-dataflow), so not sure the proper place to file this. I've cross-posted it at https://github.com/GoogleCloudPlatform/google-cloud-python/issues/3191