[MESOS-6274] Agent should not allow HTTP executors to re-subscribe before containerizer recovery is done. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.0.0, 1.0.1
Fix Version/s: 1.0.2, 1.1.0
Component/s: None
Labels:
- mesosphere

Target Version/s:

1.0.2
Sprint:
Mesosphere Sprint 44
Story Points:
3

Description

In the old API, agent will send a reconnect request to the executor and then the executor will register with the agent.

Now, in the new API, agent will allow an executor to re-subscribe before containerizer recovery is done. This is problematic because containerizer has no idea about the containers yet, calling containerizer->update will lead to a failure, causing the container being killed.

[04:04:11]W:	 [Step 10/10] I0929 04:04:11.693418 22646 containerizer.cpp:580] Recovering containerizer
[04:04:11]W:	 [Step 10/10] I0929 04:04:11.693444 22646 containerizer.cpp:636] Recovering container 568968cc-f41c-475a-bb2b-45d8babd853d for executor 'default' of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
[04:04:11]W:	 [Step 10/10] I0929 04:04:11.693445 22645 http.cpp:273] HTTP POST for /agent/api/v1/executor from 172.30.2.198:42683
[04:04:11]W:	 [Step 10/10] I0929 04:04:11.693567 22645 slave.cpp:3017] Received Subscribe request for HTTP executor 'default' of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000 (via HTTP)
[04:04:11]W:	 [Step 10/10] I0929 04:04:11.693613 22645 slave.cpp:3080] Creating a marker file for HTTP based executor 'default' of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000 (via HTTP) at path '/mnt/teamcity/temp/buildTmp/SlaveRecoveryTest_0_ROOT_CGROUPS_ReconnectDefaultExecutor_XpQvvJ/meta/slaves/7e4c8518-cb45-4b09-9fa8-c029d56289e2-S0/frameworks/7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000/executors/default/runs/568968cc-f41c-475a-bb2b-45d8babd853d/http.marker'
[04:04:11]W:	 [Step 10/10] I0929 04:04:11.693733 22645 slave.cpp:3609] Handling status update TASK_RUNNING (UUID: 6cc3f9a7-d020-46f0-82c1-39fbb9d43786) for task db1f9b1b-75d2-4d96-831f-48d6f28301e8 of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
[04:04:11]W:	 [Step 10/10] I0929 04:04:11.693801 22645 slave.cpp:3609] Handling status update TASK_RUNNING (UUID: f80d217b-7844-4134-8cc8-db6998ac437e) for task 3a583cbb-8ea9-440a-864d-e68a23472368 of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
[04:04:11]W:	 [Step 10/10] E0929 04:04:11.694232 22648 slave.cpp:2055] Failed to update resources for container 568968cc-f41c-475a-bb2b-45d8babd853d of executor 'default' of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000, destroying container: Collect failed: Unknown container

Attachments

Issue Links

is related to

MESOS-6273 SlaveRecoveryTest/0.KillTaskWithHTTPExecutor is flaky

Resolved

relates to

MESOS-9667 Check failure when executor for task using resource provider resources subscribes before agent is registered

Resolved

Activity

People

Assignee:: Anand Mazumdar

Reporter:: Jie Yu

Shepherd:: Vinod Kone

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Sep/16 04:57

Updated:: 21/Mar/19 18:06

Resolved:: 04/Oct/16 02:46