Details
Description
The test is flaky because:
- It assumes the mock RP never reregisters, which might not be true.
- It does not wait for the task and executor to be reaped, which would lead to a race between containerizer destroy and test teardown and cause cgroups cleanup to fail.
- It fast-forwards the clock, which might lead to containerizer destroy failures.
- It assumes that the framework only receives two status updates, which might not be true.
Example failure log:
E0410 00:18:23.526867 1251 slave.cpp:3118] Failed to update resources for container f941cb68-9f13-418c-be1b-702e5927b1eb of executor 'default' of framework ca96f624-9590-4776-9e83-39714cebd25f-0000, destroying container: Collect failed: Failed to publish resources 'disk(allocated: foo)[RAW]:200' for container f941cb68-9f13-418c-be1b-702e5927b1eb: Resource provider 616834b9-4dbb-45a7-b762-831ce5e8534a is not subscribed I0410 00:18:23.526957 1251 containerizer.cpp:2576] Destroying container f941cb68-9f13-418c-be1b-702e5927b1eb in RUNNING state I0410 00:18:23.526979 1251 containerizer.cpp:3278] Transitioning the state of container f941cb68-9f13-418c-be1b-702e5927b1eb from RUNNING to DESTROYING I0410 00:18:23.526989 1251 containerizer.cpp:2576] Destroying container f941cb68-9f13-418c-be1b-702e5927b1eb.523acde5-8c21-4f3f-af71-7cb84b54803e in RUNNING state I0410 00:18:23.526996 1251 containerizer.cpp:3278] Transitioning the state of container f941cb68-9f13-418c-be1b-702e5927b1eb.523acde5-8c21-4f3f-af71-7cb84b54803e from RUNNING to DESTROYING I0410 00:18:23.527102 1251 linux_launcher.cpp:576] Asked to destroy container f941cb68-9f13-418c-be1b-702e5927b1eb.523acde5-8c21-4f3f-af71-7cb84b54803e ... E0410 00:18:23.535424 1246 slave.cpp:6572] Termination of executor 'default' of framework ca96f624-9590-4776-9e83-39714cebd25f-0000 failed: Failed to destroy nested containers: Failed to kill all processes in the container: Timed out after 1mins ... I0410 00:18:23.535817 1252 master.cpp:8983] Executor 'default' of framework ca96f624-9590-4776-9e83-39714cebd25f-0000 on agent ca96f624-9590-4776-9e83-39714cebd25f-S0 at slave(699)@172.16.10.211:33823 (ip-172-16-10-211.ec2.internal): wait status -1 ... ../../src/tests/mesos.cpp:926: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/memory/mesos_test_b1965800-016c-494b-8d6d-c70437c9405f/f941cb68-9f13-418c-be1b-702e5927b1eb': Device or resource busy