Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8202

Eliminate agent failover after resource checkpointing failure

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      Currently, when the agent encounters an error checkpointing its resources to disk, the agent process will exit. Now that the master sends ApplyOperationMessage to the agent in order to apply operations, we can implement operation feedback on the agent and the agent no longer needs to unconditionally terminate when checkpointing fails.

      For backward compatibility with older masters, the agent should still terminate if it receives a CheckpointResourcesMessage from the master and an error is encountered while checkpointing.

      However, when checkpointing is attempted in the handler for ApplyOperationMessage, the agent can handle errors by sending a terminal operation update to the master.

      Attachments

        Activity

          People

            greggomann Greg Mann
            gkleiman Gastón Kleiman
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: