Affects Version/s: None
Fix Version/s: None
Consider this scenario:
- Mesos cluster with 3 masters and 1 agent.
- 2 of the masters (including the leader) are upgraded to Mesos 1.4; remaining master stays at Mesos 1.3 (e.g., due to operator error).
- Agent is upgraded to Mesos 1.4
- Framework creates a reservation refinement on the agent
- Leading master fails; Mesos 1.3 master is elected as the new leader
In this scenario, the agent will send resources to the master in the new (post-refinement) format, but the master will not understand those new fields. This results in an inconsistency between the agent's resources and the master's view of the agent's resources. This could lead to various problems – in effect, the reservation the framework previously made has been "forgotten" during master failover. Similarly, if the agent attempts to unreserve the resources (using the master's version of the resource), that operation will be rejected by the agent.
To fix this, it seems we need an explicit negotiation between the agent and the master as part of registration/re-registration. The agent would examine its resources and say which capabilities it requires of the master (not just the capabilities the agent supports); if the master does not support those capabilities, the agent cannot safely register.
We could implement this either via master capabilities (agent computes the master capabilities it requires and declines to register if the master isn't new enough), or via agent capabilities (agent tells master the capabilities it is "actively using"; master refuses to allow any agent to register that is using a capability the master doesn't recognize/support). Probably the former is safer/cleaner.