Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
We discovered a flink job, which utilizing Celeborn as remote shuffle service, that had been suspended for over 20 days.
Upon conducting a thorough investigation involving heap dumps of both Flink TaskManagers and Celeborn workers, as well as examing log files, we found that a Celeborn Worker encountered an OOM exception in ReadBufferDispatcher#run, and the exception has been catch, not notify exception to listener. Consequently, the MapPartitionDataReader not further processing as it was awaiting buffer resouces, this lead to the Flink job's suspension.
Attachments
Issue Links
- links to