Details
Description
As currently implemented, on recovery Mesos agent determines that the executor is generated for command task by comparing the executor command with a current path to Mesos executor:
https://github.com/apache/mesos/blob/1.7.x/src/slave/slave.cpp#L9635
During upgrade of production cluster we observed this check to break due to the new launcher_dir being different from the one of checkpointed executor.
This can cause problems of various kind: for example, after such upgrade, Mesos master can begin to treat the checkpointed command executors as subject to resource quota.
Design considerations:
- proper solution is to checkpoint the flag indicating whether the executor is a command/docker one.
- for correct upgrade from older Mesos versions, we will need some kind of workaround to detect command executors after upgrade; the workaround logic should be skipped if there is a checkpointed flag.