Thanks for the patch!
What I meant about the leak is a scenario like this:
- NM is running version V which introduced a new key K that is associated with containers.
- A container is running which causes K to be written to the state store
- User does a rolling downgrade to V-1. The code ignores unrecognized key K.
- The container completes and the container is removed from the state store. This only removes the container keys version V knows about, and K is not one of those keys.
- At this point K has been leaked in the state store.
- That leak will be permanent until a rolling upgrade to >= V. Even then K might not be cleaned up since all the other container state has been removed, probably interfering with the typical recovery flow for that key type.
There are a couple of risks when cleaning up unrecognized keys. The old version may be removing the key too early in the lifecycle of that state such that if we do a rolling upgrade back to the version that works with those keys we've incorrectly destroyed the state. We probably need to think more about the ramifications of cleaning unrecognized keys and when we should or shouldn't do so. Appreciate any thoughts on this.
The other risk is that doing this cleaning will add a place where the NM will read the state store as it scans for keys to remove, and previously it only ever wrote to the store after the initial recover on startup. Writes to leveldb are typically very fast, whereas reads could be much slower depending upon how much the database needs to be compacted and how many blocks are involved in the scan. This is likely a minor concern especially with the recent periodic full compaction to the store, but it will impact state store performance to some degree.
As for the patch the changes will make the NM more tolerant of new container keys, but there are other places where unexpected keys will break the state store recovery. loadResourceTrackerState and loadUserLocalizedResources are some other places that should be updated and there are similar questions there as to what should be done about cleanup of unexpected keys.