Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-1788

Using google-cloud-datastore in Beam requires re-authentication

Details

    Description

      When I run a pipeline, I believe everything (params, lexically scoped variables) must be pickleable for the individual processing stages.

      I have to load a dependent datastore record in one of my processing pipelines. (Horribly inefficient, I know, but it's my DB design for now...)

      A google.cloud.datastore.Client() is not serializable due to the google.cloud.datastore._http.Connection it contains, that is using GRPC:

        File "lib/apache_beam/transforms/ptransform.py", line 474, in __init__
          self.args = pickler.loads(pickler.dumps(self.args))
        File "lib/apache_beam/internal/pickler.py", line 212, in loads
          return dill.loads(s)
        File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", line 277, in loads
          return load(file)
        File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", line 266, in load
          obj = pik.load()
        File "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 864, in load
          dispatch[key](self)
        File "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1089, in load_newobj
          obj = cls.__new__(cls, *args)
        File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 35, in grpc._cython.cygrpc.Channel.__cinit__ (src/python/grpcio/grpc/_cython/cygrpc.c:4022)
      TypeError: __cinit__() takes at least 2 positional arguments (0 given)
      

      So instead, constructing a Client inside my pipeline...it appears to be jumping through hoops to recreate the Client, in that each execution of my pipeline is printing:
      DEBUG:google_auth_httplib2:Making request: POST https://accounts.google.com/o/oauth2/token

      I'm sure Google SRE would be very unhappy if I scaled up this mapreduce.

      This is a tricky cross-team interaction issue (only occurs for those using google-cloud-datastore and apache-beam google-dataflow), so not sure the proper place to file this. I've cross-posted it at https://github.com/GoogleCloudPlatform/google-cloud-python/issues/3191

      Attachments

        Activity

          People

            altay Ahmet Altay
            mlambert Mike Lambert
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: