Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8353

Duplicate task for same framework on multiple agents crashes out master after failover

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      We have seen a mesos master crash loop after a leader failover. After more investigation, it seems that a same task ID was managed to be created onto multiple Mesos agents in the cluster.

      One possible logical sequence which can lead to such problem:

      1. Task T1 was launched to master M1 on agent A1 for framework F;
      2. Master M1 failed over to M2;
      3. Before A1 reregistered to M2, the same T1 was launched on to agent A2: M2 does not know previous T1 yet so it accepted it and sent to A2;
      4. A1 reregistered: this probably crashed M2 (because same task cannot be added twice);
      5. When M3 tries to come up after M2, it further crashes because both A1 and A2 tried to add a T1 to the framework.

      (I only have logs to prove the last step right now)

      This happened on 1.4.0 masters.

      Although this is probably triggered by incorrect retry logic on framework side, I wonder whether Mesos master should do extra protection to prevent such issue to happen. One possible idea to instruct one of the agents carrying tasks w/ duplicate ID to terminate corresponding tasks, or just refuse to reregister such agents and instruct them to shutdown.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              zhitao Zhitao Li
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: