Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-14146

Python Streaming job failing to drain with BigQueryIO write errors

Details

    • Bug
    • Status: Open
    • P1
    • Resolution: Unresolved
    • 2.37.0
    • 2.40.0
    • io-py-gcp, sdk-py-core
    • None

    Description

      We have a Python Streaming Dataflow job that writes to BigQuery using the FILE_LOADS method and auto_sharding enabled. When we try to drain the job it fails with the following error,

      "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1000, in perform_load_job ValueError: Either a non-empty list of fully-qualified source URIs must be provided via the source_uris parameter or an open file object must be provided via the source_stream parameter.
      

      Our WriteToBigQuery configuration,

      beam.io.WriteToBigQuery(
        table=options.output_table,
        schema=bq_schema,
        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
        write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
        insert_retry_strategy=RetryStrategy.RETRY_ON_TRANSIENT_ERROR,
        method=beam.io.WriteToBigQuery.Method.FILE_LOADS,
        additional_bq_parameters={
          "timePartitioning": {
            "type": "HOUR",
            "field": "bq_insert_timestamp",
          },
          "schemaUpdateOptions": ["ALLOW_FIELD_ADDITION", "ALLOW_FIELD_RELAXATION"],
        },
        triggering_frequency=120,
        with_auto_sharding=True,
      )
      

      We are also noticing that the job only fails to drain when there are actual schema updates. If there are no schema updates the job drains without the above error.

      Attachments

        Activity

          People

            heejong Heejong Lee
            rahuli Rahul Iyer
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h 40m
                1h 40m