Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-1554

Persistent resources support for storage-like services

    Details

    • Type: Epic
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: fetcher
    • Epic Name:
      Persistence

      Description

      This question came up in dev mailing list.
      It seems reasonable for storage like services (e.g. HDFS or Cassandra) to use Mesos to manage it's instances. But right now if we'd like to restart instance (e.g. to spin up a new version) - all previous instance version sandbox filesystem resources will be recycled by slave's garbage collector.

      At the moment filesystem resources can be managed out of band - i.e. instances can save their data in some database specific placed, that various instances can share (e.g. /var/lib/cassandra).

      Benjamin Hindman suggested an idea in the mailing list (though it still needs some fleshing out):

      The idea originally came about because, even today, if we allocate some
      file system space to a task/executor, and then that task/executor
      terminates, we haven't officially "freed" those file system resources until
      after we garbage collect the task/executor sandbox! (We keep the sandbox
      around so a user/operator can get the stdout/stderr or anything else left
      around from their task/executor.)

      To solve this problem we wanted to be able to let a task/executor terminate
      but not give up all of it's resources, hence: persistent resources.

      Pushing this concept even further you could imagine always reallocating
      resources to a framework that had already been allocated those resources
      for a previous task/executor. Looked at from another perspective, these are
      "late-binding", or "lazy", resource reservations.

      At one point in time we had considered just doing 'right-of-first-refusal'
      for allocations after a task/executor terminate. But this is really
      insufficient for supporting storage-like frameworks well (and likely even
      harder to reliably implement then 'persistent resources' IMHO).

      There are a ton of things that need to get worked out in this model,
      including (but not limited to), how should a file system (or disk) be
      exposed in order to be made persistent? How should persistent resources be
      returned to a master? How many persistent resources can a framework get
      allocated?

        Issue Links

          Issues in Epic

            Activity

            Hide
            wolframarnold Wolfram Arnold added a comment -

            I'd love to see this feature for running Elasticsearch instances on mesos.

            Show
            wolframarnold Wolfram Arnold added a comment - I'd love to see this feature for running Elasticsearch instances on mesos.
            Hide
            stevenschlansker Steven Schlansker added a comment -

            It would be nice to be able to manage e.g. Amazon EBS (or generic SAN) volumes in this way. That would be very powerful indeed.

            Show
            stevenschlansker Steven Schlansker added a comment - It would be nice to be able to manage e.g. Amazon EBS (or generic SAN) volumes in this way. That would be very powerful indeed.
            Hide
            dmontauk Dobromir Montauk added a comment -

            Separating "resources" from the "running job" makes a lot of sense. That's how Borg at Google works.

            They have a separate concept, "allocation", that you can use (but don't have to). You define an allocation just like a task/job - how much CPU, RAM, etc it gets. Then you can put your tasks "into" the allocation. They have their own CPU, RAM, etc requirements and obviously have to fit.

            Borg then has separate commands for allocations and jobs. If you just touch the job (up/down/restart/etc), then the allocation sticks around, and can be reused. All disk resources, CPU reservation, etc is still there. Note that allocations must support more than just "persistent disk" - otherwise, there's a chance that the job won't schedule because CPU/RAM is used by someone else, and you've just lost all your "persistence" benefits! To wipe away the job entirely you have to remove the allocation itself (which, being very dangerous, was usually secured with a different permission set than the job).

            It looks like the design right now is mostly around "persistent disk" but I'm not sure that's really going to work longer-term. We should make "allocations" first-class objects that, like tasks, can reserve anything, and have jobs just running inside an alloc.

            Show
            dmontauk Dobromir Montauk added a comment - Separating "resources" from the "running job" makes a lot of sense. That's how Borg at Google works. They have a separate concept, "allocation", that you can use (but don't have to). You define an allocation just like a task/job - how much CPU, RAM, etc it gets. Then you can put your tasks "into" the allocation. They have their own CPU, RAM, etc requirements and obviously have to fit. Borg then has separate commands for allocations and jobs. If you just touch the job (up/down/restart/etc), then the allocation sticks around, and can be reused. All disk resources, CPU reservation, etc is still there. Note that allocations must support more than just "persistent disk" - otherwise, there's a chance that the job won't schedule because CPU/RAM is used by someone else, and you've just lost all your "persistence" benefits! To wipe away the job entirely you have to remove the allocation itself (which, being very dangerous, was usually secured with a different permission set than the job). It looks like the design right now is mostly around "persistent disk" but I'm not sure that's really going to work longer-term. We should make "allocations" first-class objects that, like tasks, can reserve anything, and have jobs just running inside an alloc.
            Hide
            dmontauk Dobromir Montauk added a comment -

            Just got pointed to https://issues.apache.org/jira/browse/MESOS-2018 which is what I was looking for Exciting!

            Show
            dmontauk Dobromir Montauk added a comment - Just got pointed to https://issues.apache.org/jira/browse/MESOS-2018 which is what I was looking for Exciting!
            Hide
            vaibhavkhanduja Vaibhav Khanduja added a comment -

            Managing SAN or iSCSI or even Amazon EBS volumes are something which should be worked on. There are number of scenarios which would require interacting with backend storage, from initial provisioning to expansion of space. A framework that can connect with such services in backend can be build with callbacks or hooks or extensions in the executors. The garbage collection or releasing of resources is something which can be be asynchronous or times activity scheduled, something similar to java jvm. Such schedule gc shall also enable use of extended data services on the backend data.

            Show
            vaibhavkhanduja Vaibhav Khanduja added a comment - Managing SAN or iSCSI or even Amazon EBS volumes are something which should be worked on. There are number of scenarios which would require interacting with backend storage, from initial provisioning to expansion of space. A framework that can connect with such services in backend can be build with callbacks or hooks or extensions in the executors. The garbage collection or releasing of resources is something which can be be asynchronous or times activity scheduled, something similar to java jvm. Such schedule gc shall also enable use of extended data services on the backend data.
            Hide
            adam-mesos Adam B added a comment -

            This Epic/feature is critical for stateful frameworks in Mesos 0.23 and beyond. Upgraded Priority to Critical.

            Show
            adam-mesos Adam B added a comment - This Epic/feature is critical for stateful frameworks in Mesos 0.23 and beyond. Upgraded Priority to Critical.
            Hide
            adam-mesos Adam B added a comment -

            Michael Park, Jie Yu, what's left before we can say that "Persistent Volumes" has shipped?
            Can we move the unresolved tasks from this JIRA into a Persistent Volumes v2 Epic, so we can close this one out?

            Show
            adam-mesos Adam B added a comment - Michael Park , Jie Yu , what's left before we can say that "Persistent Volumes" has shipped? Can we move the unresolved tasks from this JIRA into a Persistent Volumes v2 Epic, so we can close this one out?

              People

              • Assignee:
                mcypark Michael Park
                Reporter:
                nekto0n Nikita Vetoshkin
              • Votes:
                37 Vote for this issue
                Watchers:
                110 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Development