Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-11905

GCP DataFlow not cleaning up GCP BigQuery temporary datasets

Details

    • Improvement
    • Status: Open
    • P3
    • Resolution: Unresolved
    • 2.27.0
    • None
    • io-py-gcp
    • None
    • GCP DataFlow

    Description

      I'm running a number of GCP DataFlow jobs to transform some tables within GCP BigQuery, and they're creating a bunch of temporary datasets that are not deleted when the job completes successfully. I'm running the GCP DataFlow jobs by using Airflow / GCP Cloud Composer.

      The Composer environment Airflow UI does not reveal anything. When I go into GCP DataFlow, click on a job named $BATCH_JOB marked with "Status: Succeeded" and "SDK version: 2.27.0", a step within that job and a stage within that step , and then open up the Logs window and filter for "LogLevel: Error" and click on a log message, I get this:

       

      ```bash

      Error message from worker: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 649, in do_work work_executor.execute() File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 226, in execute self._split_task) File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 234, in _perform_source_split_considering_api_limits desired_bundle_size) File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 271, in _perform_source_split for split in source.split(desired_bundle_size): File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line 796, in split schema, metadata_list = self._export_files(bq) File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line 881, in _export_files bq.wait_for_bq_job(job_ref) File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 525, in wait_for_bq_job job_reference.jobId, job.status.errorResult)) RuntimeError: BigQuery job beam_bq_job_EXPORT_latestrow060a408d75f23074efbacd477228b4b30bc_68cc517f-f_436 failed. Error Result: <ErrorProto message: 'Not found: Table motorefi-analytics:temp_dataset_3a43c81c858e429f871d37802d7ac4f6.temp_table_3a43c81c858e429f871d37802d7ac4f6 was not found in location US' reason: 'notFound'>

      ```

       

      I would provide the equivalent REST for the batch job description but I'm not sure if it is helpful or sensitive information.

       

      I'm not sure whether Beam v2.27.0 is affected by https://issues.apache.org/jira/browse/BEAM-6514 or https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609, but I am using the Python 3.7 SDK v2.27.0 and not the Java SDK.

       

      Appreciate any help for this issue.

      Attachments

        Activity

          People

            Unassigned Unassigned
            yingw787 Ying Wang
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: