Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-9693

Possible memory leak in jobmanager retaining archived checkpoints

    Details

      Description

      First, some context about the job

      • Flink 1.4.1
      • stand-alone deployment mode
      • embarrassingly parallel: all operators are chained together
      • parallelism is over 1,000
      • stateless except for Kafka source operators. checkpoint size is 8.4 MB.
      • set "state.backend.fs.memory-threshold" so that only jobmanager writes to S3 to checkpoint
      • internal checkpoint with 10 checkpoints retained in history

       

      Summary of the observations

      • 41,567 ExecutionVertex objects retained 9+ GB of memory
      • Expanded in one ExecutionVertex. it seems to storing the kafka offsets for source operator

        Attachments

        1. ExecutionVertexZoomIn.png
          86 kB
          Steven Zhen Wu
        2. 41K_ExecutionVertex_objs_retained_9GB.png
          176 kB
          Steven Zhen Wu
        3. 20180725_jm_mem_leak.png
          337 kB
          Steven Zhen Wu

          Issue Links

            Activity

              People

              • Assignee:
                till.rohrmann Till Rohrmann
                Reporter:
                stevenz3wu Steven Zhen Wu
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: