Currently, subscription (and re-subscription) is not atomic.
It consists of three steps performed by two actors:
- Validating the supplied FrameworkInfo against the master state (which possibly includes an existing FrameworkInfo)
- Authorizing the (re-)subscribing framework
- Applying the update
A partitioned or buggy (or both) framework can trigger a race by sending two SUBSCRIBE calls with differing FrameworkInfo's on master failover.
One of the possible sequences of events:
1. FrameworkInfo A is validated by master (which has no data about this framework)
2. conflicting FrameworkInfo B is validated by master (which stores no data about this framework as SchedulerA is not even authorized yet)
3. Scheduler A is authorized
4. Scheduler B is authorized
5. FrameworkInfo A is applied
6. Master attempts to apply FrameworkInfoB which is no longer valid after the previous step.
One simple example is an attempt to re-subscribe with two different principals: currently the scheduler B's principal will be silently ignored at step 6 (instead of a validation error sent to B).
At the moment of writing I'm not sure if there are other problems caused by this race.