Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9542

Hierarchical allocator check failure when an operation on a shutdown framework finishes

    XMLWordPrintableJSON

Details

    Description

      When a non-speculated operation like e.g., CREATE_DISK becomes terminal after the originating framework was torn down, we run into an assertion failure in the allocator.

      I0129 11:55:35.764394 57857 master.cpp:11373] Updating the state of operation 'operation' (uuid: 10a782bd-9e60-42da-90d6-c00997a25645) for framework a4d0499b-c0d3-4abf-8458-73e595d061ce-0000 (latest state: OPERATION_PENDING, status update state: OPERATION_FINISHED)
      F0129 11:55:35.764744 57925 hierarchical.cpp:834] Check failed: frameworks.contains(frameworkId)

      With non-speculated operations like e.g., CREATE_DISK it became possible that operations outlive their originating framework. This was not possible with speculated operations like RESERVE which were always applied immediately by the master.

      The master does not take this into account, but instead unconditionally calls Allocator::updateAllocation which asserts that the framework is still known to the allocator.

      Reproducer:

      • register a framework with the master.
      • add a master with a resource provider.
      • let the framework trigger a non-speculated operation like CREATE_DISK.
      • tear down the framework before a terminal operation status update reaches the master; this causes the master to e.g., remove the framework from the allocator.
      • let a terminal, successful operation status update reach the master
      • 💥 

      To solve this we should cleanup the lifetimes of operations. Since operations can outlive their framework (unlike e.g., tasks), we probably need a different approach here.

      Attachments

        Issue Links

          Activity

            People

              kaysoky Joseph Wu
              bbannier Benjamin Bannier
              Greg Mann Greg Mann
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: