Uploaded image for project: 'Apache Celeborn'
  1. Apache Celeborn
  2. CELEBORN-1580

Job hang due to ReadBufferDispatcher not notify exception to listener

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      We discovered a flink job, which utilizing Celeborn as remote shuffle service, that had been suspended for over 20 days.

      Upon conducting a thorough investigation involving heap dumps of both Flink TaskManagers and Celeborn workers, as well as examing log files, we found that a Celeborn Worker encountered an OOM exception in ReadBufferDispatcher#run, and the exception has been catch, not notify exception to listener. Consequently, the MapPartitionDataReader not further processing as it was awaiting buffer resouces, this lead to the Flink job's suspension.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            yyhx xuhuang
            yyhx xuhuang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 40m
                40m

                Slack

                  Issue deployment