Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-13795

`beam.CombineValues` on DataFlow runner causes ambiguous failure with python SDK

Details

    • Bug
    • Status: Open
    • P2
    • Resolution: Unresolved
    • 2.35.0
    • None
    • Can provide Dockerfile, pyproject.toml, poetry.lock files on request.
      Using Apache Beam 2.35.0 with GCP extras, on Python 3.8.10.

    Description

       

      The following beam pipeline works correctly using `DirectRunner` but fails with a very vague error when using `DataflowRunner`.

      (    
      pipeline    
      | beam.io.ReadFromPubSub(input_topic, with_attributes=True)    
      | beam.Map(pubsub_message_to_row)    
      | beam.WindowInto(beam.transforms.window.FixedWindows(5))    
      | beam.GroupBy(<beam.Row col name>)    
      | beam.CombineValues(<instance of beam.CombineFn subclass>)    
      | beam.Values()  
      | beam.io.gcp.bigquery.WriteToBigQuery( . . . )
      )

      Stacktrace:

      Traceback (most recent call last):
        File "src/read_quality_pipeline/__init__.py", line 128, in <module>
          (
        File "/home/pkg_dev/.cache/pypoetry/virtualenvs/apache-beam-poc-5nxBvN9R-py3.8/lib/python3.8/site-packages/apache_beam/pipeline.py", line 597, in __exit__
          self.result.wait_until_finish()
        File "/home/pkg_dev/.cache/pypoetry/virtualenvs/apache-beam-poc-5nxBvN9R-py3.8/lib/python3.8/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1633, in wait_until_finish
          raise DataflowRuntimeException(
      apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
      Error processing pipeline. 

      Log output:

      2022-02-01T16:54:43.645Z: JOB_MESSAGE_WARNING: Autoscaling is enabled for Dataflow Streaming Engine. Workers will scale between 1 and 100 unless maxNumWorkers is specified.
      2022-02-01T16:54:43.736Z: JOB_MESSAGE_DETAILED: Autoscaling is enabled for job 2022-02-01_08_54_40-8791019287477103665. The number of workers will be between 1 and 100.
      2022-02-01T16:54:43.757Z: JOB_MESSAGE_DETAILED: Autoscaling was automatically enabled for job 2022-02-01_08_54_40-8791019287477103665.
      2022-02-01T16:54:44.624Z: JOB_MESSAGE_ERROR: Error processing pipeline. 

      With the `CombineValues` step removed this pipeline successfully starts in dataflow.

       

      I thought this was an issue with Dataflow on the server side since the Dataflow API (v1b3.projects.locations.jobs.messages) is just returning the textPayload: "Error processing pipeline". But then I found the issue BEAM-12636 where a go SDK user has the same error message but seemingly as a result of bugs in the go SDK?

      Attachments

        Activity

          People

            Unassigned Unassigned
            Jake_Zuliani Jake Zuliani

            Dates

              Created:
              Updated:

              Slack

                Issue deployment