[MESOS-8850] Race between master and allocator when destroying shared volume could lead to sorter check failure. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: allocation, master
Labels:
None

Description

When destroying shared volume, master first rescinds offers that contain the shared volume and then apply the destroy operation. This process involves interaction between the master and allocator actor. The following race could arise:

1. Framework1 and framework2 are each offered a shared disk;
2. Framework2 asks the master to destroy the shared disk;
3. Master rescinds framework1's offer that contains the shared disk;
4. `allocator->recoverResources` is called to recover framework1’s offered resources in the allocator;
5. [Race] Allocator shortly allocates resources to framework1. The allocation contains the shared disk that just got recovered which has not been destroyed at the moment. Allocator invokes `offerCallback` which dispatches to the master;
6. Master continues the destroy operation and calls `allocator->updateAllocation` to notify the allocator to transform the shared disk to regular reserved disk;
7. Master processes the `offerCallback` dispatched in step5 and offered the shared disk to framework1.

At this point, the same disk resource appears in two different places: one shared offered to framework1, one not shared currently hold by framework2 (soon to be recovered).

Attachments

Issue Links

causes

MESOS-8778 Fatal error in `DRFSorter::unallocated()` in `SharedPersistentVolumeRescindOnDestroy` test.

Open

is related to

MESOS-4553 Manage offers in allocator.

Accepted

Activity

People

Assignee:: Unassigned

Reporter:: Meng Zhu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Apr/18 19:10

Updated:: 22/Jan/19 18:48