The major version should change when an older version of the software should not try to use the state store. If we only bump the minor version then the old software will happily use the state store because all schemas with the same major version are "compatible."
So we need to think about two scenarios:
- What happens if we upgrade to a newer version of software that sees the old schema without these keys?
- What happens if we downgrade from a newer version of software with these keys to an older one that doesn't know about them?
For #1 I think it's easy. Old software doesn't support queued containers, so those keys won't be there. No queued containers means nothing to restore for that subsystem, so we should be fine during recovery.
For #2 it's more complicated. If we have queued containers then do a rolling downgrade then we could end up losing those containers because the old software doesn't support them. Therefore I think we can't support rolling downgrades as soon as queued containers are used.
So it looks like the proper way forward is to bump the major version because of the lack of rolling downgrade support. IMHO the version number should be updated "lazily," meaning if we're currently on schema version 1 but never use queued containers then it stays at version 1. If we're on version 1 when a queued container needs to be saved in the state store then we update the major version at that time. This has a number of important benefits to the end user:
- No need for a "migration script" that needs to be run manually
- Users don't lose the ability to do a rolling downgrade until they leverage the functionality that broke the ability to downgrade.
This matches the precedent set by the container ID epoch change for RM work-preserving restart in 2.6. 2.5 apps were supported on 2.6 until the user did a work-preserving RM restart, since that's what caused the epoch ID to be added to the container ID, breaking any 2.5 app that tried to parse a container ID.