Affects Version/s: 1.5.3, 1.6.0, 1.7.0
Fix Version/s: None
Component/s: Runtime / Coordination
While going over the ZooKeeper based stores (ZooKeeperSubmittedJobGraphStore, ZooKeeperMesosWorkerStore, ZooKeeperCompletedCheckpointStore) and the underlying ZooKeeperStateHandleStore I noticed several inconsistencies which were introduced with past incremental changes.
- Depending whether ZooKeeperStateHandleStore#getAllSortedByNameAndLock or ZooKeeperStateHandleStore#getAllAndLock is called, deserialization problems will either lead to removing the Znode or not
- ZooKeeperStateHandleStore leaves inconsistent state in case of exceptions (e.g. #getAllAndLock won't release the acquired locks in case of a failure)
- ZooKeeperStateHandleStore has too many responsibilities. It would be better to move RetrievableStateStorageHelper out of it for a better separation of concerns
- ZooKeeperSubmittedJobGraphStore overwrites a stored JobGraph even if it is locked. This should not happen since it could leave another system in an inconsistent state (imagine a changed JobGraph which restores from an old checkpoint)
- Redundant but also somewhat inconsistent put logic in the different stores
- Shadowing of ZooKeeper specific exceptions in ZooKeeperStateHandleStore which were expected to be caught in ZooKeeperSubmittedJobGraphStore
- Getting rid of the SubmittedJobGraphListener would be helpful
These problems made me think how reliable these components actually work. Since these components are very important, I propose to refactor them.