Description
Currently when the agent detects that the host was rebooted it doesn't recover agent info. New agent info is not checkpointed until the agent successfully registers with a master. If the agent crashes before registering, on restart it will recover the old agent info that was checkpointed before host reboot.
This can lead to problems. E.g. the agent may flap due to incompatible agent info, if its resources somehow change after reboot. Or the usage of the old agent ID in reregistration process may cause crashes like MESOS-7432.
We can remove the "latest" symlink when we detect that current boot ID is different from the checkpointed one in order to prevent the agent from recovering stale info after we checkpoint new boot ID. Or we can postpone boot ID checkpointing until we checkpointed new agent info.
Attachments
Issue Links
- relates to
-
MESOS-7432 Agent state can become corrupted
- Open