Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-19293

RocksDB last_checkpoint.state_size grows endlessly until savepoint/restore

Agile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Hi Guys,

      I am seeing some strange behaviour that may be a bug, or may just be intended.

      We are running a Flink job on a 1.10.1 cluster running with 1 JobManager and 2 TaskManagers, parallelism 4. The job itself is simple:

      1. Source: kinesis connector reading from a single shard stream
      2. CEP: ~25 CEP Keyed Pattern operators watching the event stream for different kinds of behaviour. They all have ".withinSeconds(xxxx)" applied. Nothing is set up to grow endlessly.
      3. Sink: Single operator writing messages to SQS (custom code)

      We are seeing the checkpoint size grow constantly until the job is restarted using a savepoint/restore. The size continues to grow past the point that the ".withinSeconds(xxxx)" limits should cause old data to be discarded. The growth is also out of proportion to the general platform growth (which is actually trending down at the moment due to COVID).

      I've attached a snapshot from our monitoring dashboard below. You can see the huge drops in state_size on a savepoint/restore.

      Our state configuration is as follows:

      Backend: RocksDB
      Mode: EXACTLY_ONCE
      Max Concurrent: 1
      Externalised Checkpoints: RETAIN_ON_CANCELLATION
      Async: TRUE
      Incremental: TRUE
      TTL Compaction Filter enabled: TRUE

      We are worried that the CEP library may be leaking state somewhere, leaving some objects not cleaned up. Unfortunately I can't share one of these checkpoints with the community due to the sensitive nature of the data contained within, but if anyone has any suggestions for how I could analyse the checkpoints to look for leaks, please let me know.

      Thanks in advance for the help

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            Jamalarm Thomas Wozniakowski
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment