[MESOS-9281] SLRP gets a stale checkpoint after system crash. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.5.0, 1.6.0, 1.7.0, 1.8.0
Fix Version/s: 1.7.1, 1.8.0
Component/s: storage
Labels:
- mesosphere
- storage

Target Version/s:

1.7.1, 1.8.0
Epic Link:
Resource Provider and CSI Tech Debt
Sprint:
Mesosphere RI-6 Sprint 2018-30, Mesosphere RI-6 Sprint 2018-31
Story Points:
5

Description

SLRP checkpoints a pending operations before issuing the corresponding CSI call through slave::state::checkpoint, which writes a new checkpoint to a temporary file then do a rename. However, because we don't do any fsync, rename is not atomic w.r.t. system crash. As a result, if the operation is processed during a system crash, it is possible that the CSI call has been executed, but the SLRP gets back a stale checkpoint after reboot and totally doesn't know about the operation.

To address this problem, we need to ensure the followings before issuing the CSI call:
1. The temp file is synced to the disk.
2. The rename is committed to the disk.

A possible solution is to do an fsync after writing the temp file, and do another fsync on the checkpoint dir after the rename.

Attachments

Issue Links

is related to

MESOS-9282 StatusUpdateManager does not sync checkpointed data to disk.

Accepted

Activity

People

Assignee:: Chun-Hung Hsiao

Reporter:: Chun-Hung Hsiao

Shepherd:: Jie Yu

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Oct/18 22:05

Updated:: 31/Oct/18 18:31

Resolved:: 31/Oct/18 18:31