Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-21986

taskmanager native memory not release timely after restart

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.11.3, 1.12.1, 1.13.0
    • Fix Version/s: 1.11.4, 1.13.0, 1.12.3
    • Environment:

      flink version:1.12.1
      run :yarn session
      job type:mock source -> regular join
       
      checkpoint interval: 3m
      Taskmanager memory : 16G
       

      Description

      I run a regular join job with flink_1.12.1 , and find taskmanager native memory not release timely after restart cause by exceeded checkpoint tolerable failure threshold.

      problem job information:

      1. job first restart cause by exceeded checkpoint tolerable failure threshold.
      2. then taskmanager be killed by yarn many times
      3. in this case,tm heap is set to 7.68G,bug all tm heap size is under 4.2G
      4. nonheap size increase after restart,but still under 160M.
      5. taskmanager process memory increase 3-4G after restart(this figure show one of taskmanager)

       

      my guess:

      RocksDB wiki mentioned :Many of the Java Objects used in the RocksJava API will be backed by C++ objects for which the Java Objects have ownership. As C++ has no notion of automatic garbage collection for its heap in the way that Java does, we must explicitly free the memory used by the C++ objects when we are finished with them.

      So, is it possible that RocksDBStateBackend not call AbstractNativeReference#close() to release memory use by RocksDB C++ Object ?

      I make a change:

              Actively call System.gc() and System.runFinalization() every minute.

       And run this test again:

      1. taskmanager process memory no obvious increase
      2. job run for several days,and restart many times,but no taskmanager killed by yarn like before

       

      Summary:

      1. first,there is some native memory can not release timely after restart in this situation
      2. I guess it maybe RocksDB C++ object,but I hive not check it from source code of RocksDBStateBackend

       

        Attachments

        1. 82544.svg
          92 kB
          Feifan Wang
        2. image-2021-03-25-15-53-44-214.png
          708 kB
          Feifan Wang
        3. image-2021-03-25-16-07-29-083.png
          104 kB
          Feifan Wang
        4. image-2021-03-26-11-46-06-828.png
          107 kB
          Feifan Wang
        5. image-2021-03-26-11-47-21-388.png
          107 kB
          Feifan Wang

          Issue Links

            Activity

              People

              • Assignee:
                Feifan Wang Feifan Wang
                Reporter:
                Feifan Wang Feifan Wang
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: