Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-12879

Downloading GCS objects suddenly require storage.buckets.get permission

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • 2.32.0
    • 2.37.0
    • io-py-gcp

    Description

      With PR https://github.com/apache/beam/pull/14770 downloading GCS objects requires an additional IAM role `storage.objects.get` to get the project_number based on the bucket name. 

      If the service account or user does not have said role the following error will show:

      Traceback (most recent call last):
        File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 651, in do_work
          work_executor.execute()
        File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute
          op.start()
        File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start
        File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start
        File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start
        File "dataflow_worker/native_operations.py", line 54, in dataflow_worker.native_operations.NativeReadOperation.start
        File "apache_beam/runners/worker/operations.py", line 353, in apache_beam.runners.worker.operations.Operation.output
        File "apache_beam/runners/worker/operations.py", line 215, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
        File "apache_beam/runners/worker/operations.py", line 712, in apache_beam.runners.worker.operations.DoOperation.process
        File "apache_beam/runners/worker/operations.py", line 713, in apache_beam.runners.worker.operations.DoOperation.process
        File "apache_beam/runners/common.py", line 1234, in apache_beam.runners.common.DoFnRunner.process
        File "apache_beam/runners/common.py", line 1315, in apache_beam.runners.common.DoFnRunner._reraise_augmented
        File "apache_beam/runners/common.py", line 1232, in apache_beam.runners.common.DoFnRunner.process
        File "apache_beam/runners/common.py", line 571, in apache_beam.runners.common.SimpleInvoker.invoke_process
        File "apache_beam/runners/common.py", line 1368, in apache_beam.runners.common._OutputProcessor.process_outputs
        File "/usr/local/lib/python3.7/site-packages/xyz/package/file.py", line 112, in process
          with FileSystems.open(element["gcs_uri"]) as file:
        File "/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystems.py", line 244, in open
          return filesystem.open(path, mime_type, compression_type)
        File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py", line 177, in open
          return self._path_open(path, 'rb', mime_type, compression_type)
        File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py", line 138, in _path_open
          raw_file = gcsio.GcsIO().open(path, mode, mime_type=mime_type)
        File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 227, in open
          get_project_number=self.get_project_number)
        File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 585, in __init__
          project_number = self._get_project_number(self._bucket)
        File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 166, in get_project_number
          self.bucket_to_project_number[bucket] = bucket_metadata.projectNumber
      AttributeError: 'NoneType' object has no attribute 'projectNumber' [while running 'read from GCS']
      

       

      The error message does not hint what goes wrong exactly but after some digging my assumption is that when trying to get the `bucket_metadata ` in get_project_number we get a a HTTP Error and thus a None (since when catching this error a None is returned) due to the lack of permissions leading to `bucket_metadata` being None.

      The problem is, that the required permission (`storage.buckets.get`) is only covered in the predefined role `Storage Admin (roles/storage.admin)` which I believe shouldn't be necessary in order to access objects from GCS.

      Not sure what the solution would look like: We want the metadata incl. the project number but on the other hand it seems excessive to have to give storage admin (or having to create custom roles) in order to work with GCS objects. In any case this situation needs a more elaborate error message. get_project_number should handle the situation of getting a None from get_bucket gracefully than failing on an Attribute error as seen above.

      Note: This issue will probably not only occur in the Python SDK, but I believe to have checked the Java implementation for this and at least there we should be getting a more precise error.

      First issue, don't eat me alive

      Attachments

        Issue Links

          Activity

            People

              ningk Ning
              rjany Robert Jany
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 20m
                  3h 20m