Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-25027

Allow GC of a finished job's JobMaster before the slot timeout is reached

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.14.0, 1.12.5, 1.13.3
    • 1.16.0
    • Runtime / Coordination
    • None

    Description

      In a session cluster, after a (batch) job is finished, the JobMaster seems to stick around for another couple of minutes before being eligible for garbage collection.

      Looking into a heap dump, it seems to be tied to a PhysicalSlotRequestBulkCheckerImpl which is enqueued in the underlying Akka executor (and keeps the JM from being GC’d). Per default the action is scheduled for slot.request.timeout that defaults to 5 min (thanks Till Rohrmann for helping out here)

      With this setting, you will have to account for enough metaspace to cover 5 minutes of time which may span a couple of jobs, needlessly!

      The problem seems to be that Flink is using the main thread executor for the scheduling that uses the ActorSystem's scheduler and the future task scheduled with Akka can (probably) not be easily cancelled.
      One idea could be to use a dedicated thread pool per JM, that we shut down when the JM terminates. That way we would not keep the JM from being GC’d.

      (The concrete example we investigated was a DataSet job)

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zjureel Shammon
            nkruber Nico Kruber

            Dates

              Created:
              Updated:

              Slack

                Issue deployment