Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-27396

Reduce the Heartbeat timeout after zookeeper suspended

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.14.0, 1.15.0
    • None
    • Runtime / Coordination
    • None

    Description

      After FLINK-10052, flink will tolerate zk suspension if `high-availability.zookeeper.client.tolerate-suspended-connections` is enabled. This feature is very useful, it reduces unnecessary Flink job failover in case of zk server crashing some nodes or zk rolling restart.

      Two cases result in zk SUSPENDED:

      • The zk server to which the TM/JM is connected is stopped
      • TM has a network partition.

      For the first case, we hope Flink can tolerate it. For the second case, we want the TM to fail fast, because the JM may have started a new TM, and if this TM does not fail, it may deal with duplicate data (network partitioning is complicated). But in the second case, TM will still run until zk lost(high-availability.zookeeper.client.session-timeout, default 60s) or heartbeat timeout with JM (heartbeat.timeout, default 50s).

      Can we set heartbeat.timeout to 20s if zk is suspended? If zk is suspended and the heartbeat times out, execute zk lost related logic.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            fanrui fanrui

            Dates

              Created:
              Updated:

              Slack

                Issue deployment