We have observed the following in our environment.
This follows registration of an agent that has changed its agent ID due to losing its local state.
The check failure code is in Master::removeOperation.
The masters would enter a crash loop unless the operation checkpoint state (i.e., resources_and_operations.state) on the offending agent is deleted.
Even thought we try to minimize the cases where an agent would lose its state, it can still happen when the latest symlink is removed either by an operator or automatically in certain cases.