Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-3757

Shuffle read failed using python 2.2.0

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • 2.2.0
    • Not applicable
    • runner-dataflow
    • None
    • gcp, macos

    Description

      Hi,

      First issue is that the beam 2.3.0 python SDK is apparently not working on GCP. It gets stuck: 

      Workflow failed. Causes: (bf832d44290fbf41): The Dataflow appears to be stuck. You can get help with Cloud Dataflow at https://cloud.google.com/dataflow/support. 
      

      I tried two times.

      Reverting back to 2.2.0: it usually works but today, after > 1 hour of processing, and 30 workers used, I get a failure with these in the logs:

      Traceback (most recent call last):
        File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work
          work_executor.execute()
        File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute
          op.start()
        File "dataflow_worker/shuffle_operations.py", line 49, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
          def start(self):
        File "dataflow_worker/shuffle_operations.py", line 50, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
          with self.scoped_start_state:
        File "dataflow_worker/shuffle_operations.py", line 65, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
          with self.shuffle_source.reader() as reader:
        File "dataflow_worker/shuffle_operations.py", line 67, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
          for key_values in reader:
        File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 406, in __iter__
          for entry in entries_iterator:
        File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 248, in next
          return next(self.iterator)
        File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 206, in __iter__
          chunk, next_position = self.reader.Read(start_position, end_position)
        File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 138, in shuffle_client.PyShuffleReader.Read
      IOError: Shuffle read failed: INTERNAL: Received RST_STREAM with error code 2  talking to my-dataflow-02271107-756f-harness-2p65:12346
      

      i also get some information message:

      Refusing to split <dataflow_worker.shuffle.GroupedShuffleRangeTracker object at 0x7f03a00fe790> at '\x00\x00\x00\t\x1d\x14\x87\xa3\x00\x01': unstarted
      

      For the flow, I am extracting data from BQ, cleaning using pandas, exporting as a csv file, gzipping and uploading the compressed file to a bucket using decompressive transcoding (csv export, gzip compression and upload are in the same 'worker' as they are done in the same beam.DoFn).

      PS: i can't find a reasonable way to export the logs from GCP but i can privately send the log file i have of the run on my machine (the log of the pipeline)

      Attachments

        Activity

          People

            Unassigned Unassigned
            jonathan_d Jonathan Delfour
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: