Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.4.0, 1.5.0
-
None
-
Mesosphere Sprint 75, Mesosphere Sprint 76, Mesosphere RI-6 Sprint 2018-30
-
3
Description
Two nested containers were launched and transitioned to FETCHING nearly at the same time, and tried to fetch the same artifacts. The first one failed to fetch some artifacts and transitioned to DESTROYING. However, the second nested container got stock in FETCHING and the LAUNCH_NESTED_CONTAINER call never returned.
I0226 06:27:15.000000 9494 http.cpp:2581] Processing LAUNCH_NESTED_CONTAINER call for container 'fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4' ... I0226 06:27:15.000000 9499 http.cpp:2581] Processing LAUNCH_NESTED_CONTAINER call for container 'fb23f12b-8ae3-4e03-9895-8df1b2865b11.3f6fead4-1857-4cbd-b226-fbc7337eb8cb' ... I0226 06:27:15.000000 9493 containerizer.cpp:2968] Transitioning the state of container fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4 from ISOLATING to FETCHING I0226 06:27:15.000000 9500 containerizer.cpp:2968] Transitioning the state of container fb23f12b-8ae3-4e03-9895-8df1b2865b11.3f6fead4-1857-4cbd-b226-fbc7337eb8cb from ISOLATING to FETCHING ... E0226 06:29:45.000000 9496 fetcher.cpp:568] Failed to run mesos-fetcher: Failed to fetch all URIs for container 'fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4': exited with status 1 W0226 06:29:45.000000 9497 http.cpp:2758] Failed to launch container fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4: Failed to fetch all URIs for container 'fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4': exited with status 1 I0226 06:29:45.000000 9497 containerizer.cpp:2354] Destroying container fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4 in FETCHING state I0226 06:29:45.000000 9497 containerizer.cpp:2968] Transitioning the state of container fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4 from FETCHING to DESTROYING
After a closer look at the sandbox logs, these two containers bypassed the fetcher cache when downloading artifacts, and thus should not be interfering with each other. However, the log of the stuck container stopped at "Downloading resource from ...". This might indicate that the server (here it's Amazon S3) accepted the connection but never finished a HTTP response. To avoid containers being stuck in FETCHING, we should add a download timeout to abort the fetching when the download speed is too low.
Attachments
Issue Links
- relates to
-
MESOS-9221 If some image layers are large, the image pulling may stuck due to the authorized token expired.
- Resolved