Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9281

SLRP gets a stale checkpoint after system crash.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.5.0, 1.6.0, 1.7.0, 1.8.0
    • Fix Version/s: 1.7.1, 1.8.0
    • Component/s: storage

      Description

      SLRP checkpoints a pending operations before issuing the corresponding CSI call through slave::state::checkpoint, which writes a new checkpoint to a temporary file then do a rename. However, because we don't do any fsync, rename is not atomic w.r.t. system crash. As a result, if the operation is processed during a system crash, it is possible that the CSI call has been executed, but the SLRP gets back a stale checkpoint after reboot and totally doesn't know about the operation.

      To address this problem, we need to ensure the followings before issuing the CSI call:
      1. The temp file is synced to the disk.
      2. The rename is committed to the disk.

      A possible solution is to do an fsync after writing the temp file, and do another fsync on the checkpoint dir after the rename.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                chhsia0 Chun-Hung Hsiao
                Reporter:
                chhsia0 Chun-Hung Hsiao
                Shepherd:
                Jie Yu
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: