Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-8541

Mesos RM should recover from failover timeout

    XMLWordPrintableJSON

Details

    Description

      When a framework disconnects unexpectedly from Mesos, the framework's Mesos tasks continue to run for a configurable period of time known as the failover timeout.   If the framework reconnects to Mesos after the timeout has expired, Mesos rejects the connection attempt.   It is expected that the framework discard the previous framework ID and then connect as a new framework.

      When Flink is in this situation, the only recourse is to manually delete the ZK state where the framework ID kept.   Let's improve the logic of the Mesos RM to automate that.

      Attachments

        Activity

          People

            Unassigned Unassigned
            eronwright Eron Wright
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: