Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-23209

Timeout heartbeat if the heartbeat target is no longer reachable

    XMLWordPrintableJSON

Details

    • Hide
      Flink now supports to detect dead TaskManagers via the number of consecutive failed heartbeat RPCs. The threshold until a TaskManager is marked as unreachable can be configured via `heartbeat.rpc-failure-threshold`. This can speed up the detection of dead TaskManagers significantly.
      Show
      Flink now supports to detect dead TaskManagers via the number of consecutive failed heartbeat RPCs. The threshold until a TaskManager is marked as unreachable can be configured via `heartbeat.rpc-failure-threshold`. This can speed up the detection of dead TaskManagers significantly.

    Description

      With FLINK-23202 it should now be possible to see when a remote RPC endpoint is no longer reachable. This can be used by the HeartbeatManager to mark an heartbeat target as no longer reachable. That way, it is possible for Flink to react faster to losses of components w/o having to wait for the heartbeat timeout to expire. This will result in faster recoveries (e.g. if a TaskExecutor dies).

      With this change we can improve trading off speed of detecting dead TaskManagers against running on an unstable/overloaded network where heartbeat messages are delayed.

      Attachments

        Issue Links

          Activity

            People

              trohrmann Till Rohrmann
              trohrmann Till Rohrmann
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: