Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-14070

Beam worker closing gRPC connection with many workers and large shuffle sizes


    • Bug
    • Status: Open
    • P2
    • Resolution: Unresolved
    • 2.36.0
    • None
    • sdk-py-core
    • None


      When I run a job with many workers (100 or more) and large shuffle sizes (millions of records and/or several GB), my workers fail unexpectedly with

      python -m apache_beam.runners.worker.sdk_worker_main 
      E0308 12:59:18.067442934     724 chttp2_transport.cc:1103]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings" 
      Traceback (most recent call last): 
       File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main 
         return _run_code(code, main_globals, None, 
       File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code 
         exec(code, run_globals) 
       File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 264, in <module> 
       File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 155, in main 
       File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 234, in run 
         for work_request in self._control_stub.Control(get_responses()): 
       File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 426, in __next__ 
         return self._next() 
       File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _next 
         raise self 
      grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: 
             status = StatusCode.UNAVAILABLE 
             details = "Socket closed" 
             debug_error_string = "{"created":"@1646744358.118371750","description":"Error received from peer ipv6:[::1]:34305","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket 

      This is probably related to or even the same as BEAM-12448 or BEAM-6258, but since one of them is already marked as fixed in a previous version and both reports have large tails of unreadable auto-generated comments, I decided to create a new issue.

      There is not much more information I can give you, since this is all the error output I get. It's really hard to debug and with the large number of workers I don't even know if the worker reporting the error is actually the one experiencing it.




            Unassigned Unassigned
            phoerious Janek Bevendorff
            0 Vote for this issue
            4 Start watching this issue