Uploaded image for project: 'Apache Celeborn'
  1. Apache Celeborn
  2. CELEBORN-1580

Job hang due to ReadBufferDispatcher not notify exception to listener

    XMLWordPrintableJSON

Details

    Description

      We discovered a flink job, which utilizing Celeborn as remote shuffle service, that had been suspended for over 20 days.

      Upon conducting a thorough investigation involving heap dumps of both Flink TaskManagers and Celeborn workers, as well as examing log files, we found that a Celeborn Worker encountered an OOM exception in ReadBufferDispatcher#run, and the exception has been catch, not notify exception to listener. Consequently, the MapPartitionDataReader not further processing as it was awaiting buffer resouces, this lead to the Flink job's suspension.

      Attachments

        Issue Links

          Activity

            People

              yyhx xuhuang
              yyhx xuhuang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m