Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-11098

Running Apache Beam to distribute the cleaning of a dataset in Google Cloud Dataflow

    XMLWordPrintableJSON

    Details

      Description

      Trying to download C4 via [these instructions](https://github.com/google-research/text-to-text-transfer-transformer#c4) and 3 hours into my job I get this. Can't find any help on google for this error.

       

      Traceback (most recent call last):
      File "/usr/local/lib/python3.6/site-packages/dataflow_worker/batchworker.py", line 649, in do_work
      work_executor.execute()
      File "/usr/local/lib/python3.6/site-packages/dataflow_worker/executor.py", line 179, in execute
      op.start()
      File "dataflow_worker/shuffle_operations.py", line 63, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
      File "dataflow_worker/shuffle_operations.py", line 64, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
      File "dataflow_worker/shuffle_operations.py", line 79, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
      File "dataflow_worker/shuffle_operations.py", line 80, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
      File "dataflow_worker/shuffle_operations.py", line 84, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
      File "apache_beam/runners/worker/operations.py", line 332, in apache_beam.runners.worker.operations.Operation.output
      File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
      File "dataflow_worker/shuffle_operations.py", line 261, in dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.process
      File "dataflow_worker/shuffle_operations.py", line 268, in dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.process
      File "apache_beam/runners/worker/operations.py", line 332, in apache_beam.runners.worker.operations.Operation.output
      File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
      File "apache_beam/runners/worker/operations.py", line 670, in apache_beam.runners.worker.operations.DoOperation.process
      File "apache_beam/runners/worker/operations.py", line 671, in apache_beam.runners.worker.operations.DoOperation.process
      File "apache_beam/runners/common.py", line 1215, in apache_beam.runners.common.DoFnRunner.process
      File "apache_beam/runners/common.py", line 1279, in apache_beam.runners.common.DoFnRunner._reraise_augmented
      File "apache_beam/runners/common.py", line 1213, in apache_beam.runners.common.DoFnRunner.process
      File "apache_beam/runners/common.py", line 569, in apache_beam.runners.common.SimpleInvoker.invoke_process
      File "apache_beam/runners/common.py", line 1371, in apache_beam.runners.common._OutputProcessor.process_outputs
      File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
      File "apache_beam/runners/worker/operations.py", line 670, in apache_beam.runners.worker.operations.DoOperation.process
      File "apache_beam/runners/worker/operations.py", line 671, in apache_beam.runners.worker.operations.DoOperation.process
      File "apache_beam/runners/common.py", line 1215, in apache_beam.runners.common.DoFnRunner.process
      File "apache_beam/runners/common.py", line 1294, in apache_beam.runners.common.DoFnRunner._reraise_augmented
      File "/usr/local/lib/python3.6/site-packages/future/utils/_init_.py", line 446, in raise_with_traceback
      raise exc.with_traceback(traceback)
      File "apache_beam/runners/common.py", line 1213, in apache_beam.runners.common.DoFnRunner.process
      File "apache_beam/runners/common.py", line 570, in apache_beam.runners.common.SimpleInvoker.invoke_process
      File "/mnt/pccfs/backed_up/crytting/persuasion/createc4/lib/python3.6/site-packages/apache_beam/transforms/core.py", line 815, in <lambda>
      self.process = lambda element: fn(element)
      TypeError: clean_page() got an unexpected keyword argument 'badwords_regex' [while running 'clean_pages']

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              crytting Chris Rytting
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified