Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-9693

Possible memory leak in jobmanager retaining archived checkpoints

    XMLWordPrintableJSON

Details

    Description

      First, some context about the job

      • Flink 1.4.1
      • stand-alone deployment mode
      • embarrassingly parallel: all operators are chained together
      • parallelism is over 1,000
      • stateless except for Kafka source operators. checkpoint size is 8.4 MB.
      • set "state.backend.fs.memory-threshold" so that only jobmanager writes to S3 to checkpoint
      • internal checkpoint with 10 checkpoints retained in history

       

      Summary of the observations

      • 41,567 ExecutionVertex objects retained 9+ GB of memory
      • Expanded in one ExecutionVertex. it seems to storing the kafka offsets for source operator

      Attachments

        1. ExecutionVertexZoomIn.png
          86 kB
          Steven Zhen Wu
        2. 41K_ExecutionVertex_objs_retained_9GB.png
          176 kB
          Steven Zhen Wu
        3. 20180725_jm_mem_leak.png
          337 kB
          Steven Zhen Wu

        Issue Links

          Activity

            People

              trohrmann Till Rohrmann
              stevenz3wu Steven Zhen Wu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: