Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-21986

taskmanager native memory not release timely after restart

Agile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.11.3, 1.12.1, 1.13.0
    • 1.11.4, 1.13.0, 1.12.3
    • flink version:1.12.1
      run :yarn session
      job type:mock source -> regular join
       
      checkpoint interval: 3m
      Taskmanager memory : 16G
       

    Description

      I run a regular join job with flink_1.12.1 , and find taskmanager native memory not release timely after restart cause by exceeded checkpoint tolerable failure threshold.

      problem job information:

      1. job first restart cause by exceeded checkpoint tolerable failure threshold.
      2. then taskmanager be killed by yarn many times
      3. in this case,tm heap is set to 7.68G,bug all tm heap size is under 4.2G
      4. nonheap size increase after restart,but still under 160M.
      5. taskmanager process memory increase 3-4G after restart(this figure show one of taskmanager)

       

      my guess:

      RocksDB wiki mentioned :Many of the Java Objects used in the RocksJava API will be backed by C++ objects for which the Java Objects have ownership. As C++ has no notion of automatic garbage collection for its heap in the way that Java does, we must explicitly free the memory used by the C++ objects when we are finished with them.

      So, is it possible that RocksDBStateBackend not call AbstractNativeReference#close() to release memory use by RocksDB C++ Object ?

      I make a change:

              Actively call System.gc() and System.runFinalization() every minute.

       And run this test again:

      1. taskmanager process memory no obvious increase
      2. job run for several days,and restart many times,but no taskmanager killed by yarn like before

       

      Summary:

      1. first,there is some native memory can not release timely after restart in this situation
      2. I guess it maybe RocksDB C++ object,but I hive not check it from source code of RocksDBStateBackend

       

      Attachments

        1. 82544.svg
          92 kB
          Feifan Wang
        2. image-2021-03-26-11-47-21-388.png
          107 kB
          Feifan Wang
        3. image-2021-03-26-11-46-06-828.png
          107 kB
          Feifan Wang
        4. image-2021-03-25-16-07-29-083.png
          104 kB
          Feifan Wang
        5. image-2021-03-25-15-53-44-214.png
          708 kB
          Feifan Wang

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Feifan Wang Feifan Wang
            Feifan Wang Feifan Wang
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment