[BEAM-14070] Beam worker closing gRPC connection with many workers and large shuffle sizes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: P2
Resolution: Unresolved
Affects Version/s: 2.36.0
Fix Version/s: None
Component/s: sdk-py-core
Labels:
None

Description

When I run a job with many workers (100 or more) and large shuffle sizes (millions of records and/or several GB), my workers fail unexpectedly with

python -m apache_beam.runners.worker.sdk_worker_main 
E0308 12:59:18.067442934     724 chttp2_transport.cc:1103]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings" 
Traceback (most recent call last): 
 File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main 
   return _run_code(code, main_globals, None, 
 File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code 
   exec(code, run_globals) 
 File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 264, in <module> 
   main(sys.argv) 
 File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 155, in main 
   sdk_harness.run() 
 File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 234, in run 
   for work_request in self._control_stub.Control(get_responses()): 
 File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 426, in __next__ 
   return self._next() 
 File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _next 
   raise self 
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: 
       status = StatusCode.UNAVAILABLE 
       details = "Socket closed" 
       debug_error_string = "{"created":"@1646744358.118371750","description":"Error received from peer ipv6:[::1]:34305","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket 
closed","grpc_status":14}" 
>

This is probably related to or even the same as BEAM-12448 or ~~BEAM-6258~~, but since one of them is already marked as fixed in a previous version and both reports have large tails of unreadable auto-generated comments, I decided to create a new issue.

There is not much more information I can give you, since this is all the error output I get. It's really hard to debug and with the large number of workers I don't even know if the worker reporting the error is actually the one experiencing it.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Janek Bevendorff

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 08/Mar/22 13:44

Updated:: 05/Jun/22 00:30