Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-11205

Task Manager Metaspace Memory Leak

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 1.5.5, 1.6.2, 1.7.0
    • None
    • Runtime / Coordination
    • None
    • Important

    Description

      Job Restarts causes task manager to dynamically load duplicate classes. Metaspace is unbounded and grows with every restart. YARN aggressively kill such containers but this affect is immediately seems on different task manager which results in death spiral.

      Task Manager uses dynamic loader as described in https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html

      YARN

      YARN classloading differs between single job deployments and sessions:

      • When submitting a Flink job/application directly to YARN (via bin/flink run -m yarn-cluster ...), dedicated TaskManagers and JobManagers are started for that job. Those JVMs have both Flink framework classes and user code classes in the Java classpath. That means that there is no dynamic classloading involved in that case.
      • When starting a YARN session, the JobManagers and TaskManagers are started with the Flink framework classes in the classpath. The classes from all jobs that are submitted against the session are loaded dynamically.

      The above is not entirely true specially when you set -yD classloader.resolve-order=parent-first . We also above observed the above behaviour when submitting a Flink job/application directly to YARN (via bin/flink run -m yarn-cluster ...).

      Attachments

        1. Screenshot 2018-12-18 at 15.47.55.png
          166 kB
          Nawaid Shamim
        2. Screenshot 2018-12-18 at 12.14.11.png
          176 kB
          Nawaid Shamim

        Issue Links

          Activity

            People

              Unassigned Unassigned
              nawaidshamim Nawaid Shamim
              Votes:
              1 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: