Details
-
Bug
-
Status: Open
-
P2
-
Resolution: Unresolved
-
2.36.0
-
None
-
None
Description
When I run a job with many workers (100 or more) and large shuffle sizes (millions of records and/or several GB), my workers fail unexpectedly with
python -m apache_beam.runners.worker.sdk_worker_main E0308 12:59:18.067442934 724 chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings" Traceback (most recent call last): File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 264, in <module> main(sys.argv) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 155, in main sdk_harness.run() File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 234, in run for work_request in self._control_stub.Control(get_responses()): File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 426, in __next__ return self._next() File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _next raise self grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "Socket closed" debug_error_string = "{"created":"@1646744358.118371750","description":"Error received from peer ipv6:[::1]:34305","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket closed","grpc_status":14}" >
This is probably related to or even the same as BEAM-12448 or BEAM-6258, but since one of them is already marked as fixed in a previous version and both reports have large tails of unreadable auto-generated comments, I decided to create a new issue.
There is not much more information I can give you, since this is all the error output I get. It's really hard to debug and with the large number of workers I don't even know if the worker reporting the error is actually the one experiencing it.