Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
1.5.0, 1.6.0, 1.7.0, 1.8.0
-
Mesosphere RI-6 Sprint 2018-30, Mesosphere RI-6 Sprint 2018-31
-
5
Description
SLRP checkpoints a pending operations before issuing the corresponding CSI call through slave::state::checkpoint, which writes a new checkpoint to a temporary file then do a rename. However, because we don't do any fsync, rename is not atomic w.r.t. system crash. As a result, if the operation is processed during a system crash, it is possible that the CSI call has been executed, but the SLRP gets back a stale checkpoint after reboot and totally doesn't know about the operation.
To address this problem, we need to ensure the followings before issuing the CSI call:
1. The temp file is synced to the disk.
2. The rename is committed to the disk.
A possible solution is to do an fsync after writing the temp file, and do another fsync on the checkpoint dir after the rename.
Attachments
Issue Links
- is related to
-
MESOS-9282 StatusUpdateManager does not sync checkpointed data to disk.
- Accepted