What if a VERSION file already exists in the directory for some reason? Should we at least print a WARN for further investigation?
The equivalent code for non-HA case (saveNamespace) also unconditionally overwrites existing VERSION. The reasoning is, regardless of previous state, now it has the up-to-date checkpoint, so it should have an accompanying VERSION file. So it is expected to overwrite if a VERSION already exists. I don't think we need to do anything here.
On the retention manager, is it the right behavior to skip purging old image files if VERSION is missing? Should we do a follow-on fix to handle the case where the VERSION file is lost for some other reasons (mis-operaiton etc.)?
At minimum, it already logs a WARN. What do you think should be done? Report a storage error by calling reportErrorsOnDirectory()? This will cause the storage dir to be in the "failed" list, which will be recovered later online. The recovery check should be made to check for existence of VERSION then.