Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.7.0, 1.7.1, 1.8.0
-
Mesos Foundations RI10 Sp 39, Mesos Foundations RI11 Sp 40
-
5
Description
When a non-speculated operation like e.g., CREATE_DISK becomes terminal after the originating framework was torn down, we run into an assertion failure in the allocator.
I0129 11:55:35.764394 57857 master.cpp:11373] Updating the state of operation 'operation' (uuid: 10a782bd-9e60-42da-90d6-c00997a25645) for framework a4d0499b-c0d3-4abf-8458-73e595d061ce-0000 (latest state: OPERATION_PENDING, status update state: OPERATION_FINISHED) F0129 11:55:35.764744 57925 hierarchical.cpp:834] Check failed: frameworks.contains(frameworkId)
With non-speculated operations like e.g., CREATE_DISK it became possible that operations outlive their originating framework. This was not possible with speculated operations like RESERVE which were always applied immediately by the master.
The master does not take this into account, but instead unconditionally calls Allocator::updateAllocation which asserts that the framework is still known to the allocator.
Reproducer:
- register a framework with the master.
- add a master with a resource provider.
- let the framework trigger a non-speculated operation like CREATE_DISK.
- tear down the framework before a terminal operation status update reaches the master; this causes the master to e.g., remove the framework from the allocator.
- let a terminal, successful operation status update reach the master
- 💥Â
To solve this we should cleanup the lifetimes of operations. Since operations can outlive their framework (unlike e.g., tasks), we probably need a different approach here.
Attachments
Issue Links
- is related to
-
MESOS-9635 OperationReconciliationTest.AgentPendingOperationAfterMasterFailover is flaky again (3x) due to orphan operations
- Resolved