Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-25027

Allow GC of a finished job's JobMaster before the slot timeout is reached

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.14.0, 1.12.5, 1.13.3
    • 1.20.0
    • Runtime / Coordination
    • None

    Description

      In a session cluster, after a (batch) job is finished, the JobMaster seems to stick around for another couple of minutes before being eligible for garbage collection.

      Looking into a heap dump, it seems to be tied to a PhysicalSlotRequestBulkCheckerImpl which is enqueued in the underlying Akka executor (and keeps the JM from being GC’d). Per default the action is scheduled for slot.request.timeout that defaults to 5 min (thanks trohrmann for helping out here)

      With this setting, you will have to account for enough metaspace to cover 5 minutes of time which may span a couple of jobs, needlessly!

      The problem seems to be that Flink is using the main thread executor for the scheduling that uses the ActorSystem's scheduler and the future task scheduled with Akka can (probably) not be easily cancelled.
      One idea could be to use a dedicated thread pool per JM, that we shut down when the JM terminates. That way we would not keep the JM from being GC’d.

      (The concrete example we investigated was a DataSet job)

      Attachments

        Issue Links

          Activity

            People

              zjureel Fang Yong
              nkruber Nico Kruber
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated: