Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-1554

Persistent resources support for storage-like services

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Epic
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • None
    • fetcher
    • Persistence

    Description

      This question came up in dev mailing list.
      It seems reasonable for storage like services (e.g. HDFS or Cassandra) to use Mesos to manage it's instances. But right now if we'd like to restart instance (e.g. to spin up a new version) - all previous instance version sandbox filesystem resources will be recycled by slave's garbage collector.

      At the moment filesystem resources can be managed out of band - i.e. instances can save their data in some database specific placed, that various instances can share (e.g. /var/lib/cassandra).

      Benjamin Hindman suggested an idea in the mailing list (though it still needs some fleshing out):

      The idea originally came about because, even today, if we allocate some
      file system space to a task/executor, and then that task/executor
      terminates, we haven't officially "freed" those file system resources until
      after we garbage collect the task/executor sandbox! (We keep the sandbox
      around so a user/operator can get the stdout/stderr or anything else left
      around from their task/executor.)

      To solve this problem we wanted to be able to let a task/executor terminate
      but not give up all of it's resources, hence: persistent resources.

      Pushing this concept even further you could imagine always reallocating
      resources to a framework that had already been allocated those resources
      for a previous task/executor. Looked at from another perspective, these are
      "late-binding", or "lazy", resource reservations.

      At one point in time we had considered just doing 'right-of-first-refusal'
      for allocations after a task/executor terminate. But this is really
      insufficient for supporting storage-like frameworks well (and likely even
      harder to reliably implement then 'persistent resources' IMHO).

      There are a ton of things that need to get worked out in this model,
      including (but not limited to), how should a file system (or disk) be
      exposed in order to be made persistent? How should persistent resources be
      returned to a master? How many persistent resources can a framework get
      allocated?

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mcypark Michael Park
            nekto0n Nikita Vetoshkin
            Votes:
            37 Vote for this issue
            Watchers:
            83 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment