Uploaded image for project: 'Apache Airflow'
  1. Apache Airflow
  2. AIRFLOW-2009

DataFlowHook does not use correct service account

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 1.10.3
    • Component/s: gcp, hooks
    • Labels:
      None

      Description

      We have been using the DataFlowOperator to schedule DataFlow jobs.

      We found that the DataFlowHook used by the DataFlowOperator doesn't actually use the passed `gcp_conn_id` to schedule the DataFlow job, but only to read the results after. 

      code (https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/gcp_dataflow_hook.py#L158):
      _Dataflow(cmd).wait_for_done()
      _DataflowJob(self.get_conn(), variables['project'],
      name, self.poll_sleep).wait_for_done()

      The first line here should also be using self.get_conn().

      For this reason, our tasks using the DataFlowOperator have actually been using the default Google Compute Engine service account (which has DataFlow permissions) to schedule DataFlow jobs. It is only when our provided service account (which does not have DataFlow permissions) is used in the second line that we are seeing a permissions error.

      I would like to fix this bug, but have to work around it at the moment due to time constraints.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                fenglu Feng Lu
                Reporter:
                jldlaughlin Jessica Laughlin
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: