Details
-
Bug
-
Status: Accepted
-
Major
-
Resolution: Unresolved
-
1.5.1, 1.6.1, 1.7.0
-
None
Description
This is related to MESOS-9281, which we observed in a testing environment.
The status update manager used to open the checkpoint file using O_SYNC, which will guarantee that each write will be persisted to the disk (similar to calling fsync() after each write()).
This was removed due to some performance issue
https://reviews.apache.org/r/50635/
However, the assumption in the patch is no longer true after we allow the re-use the same agent ID after machine reboot. This will likely cause issues.
Attachments
Issue Links
- relates to
-
MESOS-9281 SLRP gets a stale checkpoint after system crash.
- Resolved
-
MESOS-5944 Remove `O_SYNC` from StatusUpdateManager logs
- Resolved