Details
Description
On our internal cluster with a custom authorizer module we observed the following problem:
I1212 14:16:58.424782 12492 master.cpp:4489] Processing ACCEPT call for offers: [ 98c96586-1980-4007-9651-18dd837f529a-O16233 ] on agent 98c96586-1980-4007-9651-18dd837f529a-S1 at slave(1)@172.31.15.78:5051 (172.31.15.78) for framework 98c96586-1980-4007-9651-18dd837f529a-0009 (storage) I1212 14:16:58.424831 12492 master.cpp:3949] Authorizing principal 'storage-principal' to destroy disk 'disk(allocated: dcos-storage)(reservations: [(DYNAMIC,dcos-storage,storage-principal)])[MOUNT(g4FmoSCHyiqTZWw1nkOd0v9_d17d081f-c138-472d-9e0b-aa6b21e43c7c,profile-b)]:100' I1212 14:16:58.424913 12492 master.cpp:3949] Authorizing principal 'storage-principal' to destroy disk 'disk(allocated: dcos-storage)(reservations: [(DYNAMIC,dcos-storage,storage-principal)])[MOUNT(g4FmoSCHyiqTZWw1nkOd0v9_c62a4e89-f902-4968-96d8-69b9efa5f35a,profile-b)]:100' I1212 14:16:58.424979 12492 master.cpp:3949] Authorizing principal 'storage-principal' to destroy disk 'disk(allocated: dcos-storage)(reservations: [(DYNAMIC,dcos-storage,storage-principal)])[MOUNT(g4FmoSCHyiqTZWw1nkOd0v9_1f490a57-9e41-41a0-b7be-408147281113,profile-b)]:100' I1212 14:16:58.425055 12492 master.cpp:3653] Authorizing principal 'storage-principal' to reserve resources 'disk(allocated: dcos-storage)(reservations: [(DYNAMIC,dcos-storage,storage-principal)])[MOUNT(g4FmoSCHyiqTZWw1nkOd0v9_c60ec637-3f67-4d41-bbc8-448c022c05fb,profile-b)]:100' I1212 14:16:58.425118 12492 master.cpp:3949] Authorizing principal 'storage-principal' to destroy disk 'disk(allocated: dcos-storage)(reservations: [(DYNAMIC,dcos-storage,storage-principal)])[MOUNT(g4FmoSCHyiqTZWw1nkOd0v9_5ce79163-aef2-4fd4-8073-f310dec239bf,profile-b)]:100' I1212 14:16:58.425499 12488 authorizer.cpp:957] dstport=5050 dstip=172.31.5.116 result=deny object="" action="DESTROY_MOUNT_DISK" uid="storage-principal" reason="" authorizer="mesos-master" timestamp=2018-12-12 14:16:58.425453056+00:00 type=audit I1212 14:16:58.426434 12483 master.cpp:5769] Processing DESTROY_DISK operation for volume disk(allocated: dcos-storage)(reservations: [(DYNAMIC,dcos-storage,storage-principal)])[MOUNT(g4FmoSCHyiqTZWw1nkOd0v9_d17d081f-c138-472d-9e0b-aa6b21e43c7c,profile-b)]:100 from framework 98c96586-1980-4007-9651-18dd837f529a-0 .78:5051 (172.31.15.78) I1212 14:16:58.426746 12483 master.cpp:5769] Processing DESTROY_DISK operation for volume disk(allocated: dcos-storage)(reservations: [(DYNAMIC,dcos-storage,storage-principal)])[MOUNT(g4FmoSCHyiqTZWw1nkOd0v9_c62a4e89-f902-4968-96d8-69b9efa5f35a,profile-b)]:100 from framework 98c96586-1980-4007-9651-18dd837f529a-0 .78:5051 (172.31.15.78) I1212 14:16:58.427112 12483 master.cpp:5769] Processing DESTROY_DISK operation for volume disk(allocated: dcos-storage)(reservations: [(DYNAMIC,dcos-storage,storage-principal)])[MOUNT(g4FmoSCHyiqTZWw1nkOd0v9_1f490a57-9e41-41a0-b7be-408147281113,profile-b)]:100 from framework 98c96586-1980-4007-9651-18dd837f529a-0 .78:5051 (172.31.15.78) W1212 14:16:58.427274 12483 master.cpp:2275] Dropping RESERVE operation from framework 98c96586-1980-4007-9651-18dd837f529a-0009 (storage): Not authorized to reserve resources as 'storage-principal' I1212 14:16:58.427366 12483 master.cpp:5769] Processing DESTROY_DISK operation for volume disk(allocated: dcos-storage)(reservations: [(DYNAMIC,dcos-storage,storage-principal)])[MOUNT(g4FmoSCHyiqTZWw1nkOd0v9_5ce79163-aef2-4fd4-8073-f310dec239bf,profile-b)]:100 from framework 98c96586-1980-4007-9651-18dd837f529a-0009 (storage) to agent 98c96586-1980-4007-9651-18dd837f529a-S1 at slave(1)@172.31.15.78:5051 (172.31.15.78)
The authorizer module caches authorization results, and that's why there was only one logged authorization requset. The problem is that, the logged request was for CREATE_MOUNT_DISK, and the result was deny, but despite the denial, all CREATE_DISK operations were processed, however another RESERVE operation was dropped because of this denial.
The bug is that the master pushed a authorization future in the futures vector in Master::accept for each DESTROY_DISK:
https://github.com/apache/mesos/blob/18356bf3f4ac730b4a798261aad042555c4a4834/src/master/master.cpp#L4599-L4601
However, the master never popped and checked the future in Master::_accept, but go ahead to process the operation:
https://github.com/apache/mesos/blob/18356bf3f4ac730b4a798261aad042555c4a4834/src/master/master.cpp#L5706
The future ended up mismatched with the RESERVE operation, causing it to be dropped.
Updated: a similar bug would happen if we have a validation error for LAUNCH_GROUP. See MESOS-9480.