Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8620

Containers stuck in FETCHING possibly due to unresponsive server.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0, 1.5.0
    • 1.4.3, 1.5.2, 1.6.0
    • None
    • Mesosphere Sprint 75, Mesosphere Sprint 76, Mesosphere RI-6 Sprint 2018-30
    • 3

    Description

      Two nested containers were launched and transitioned to FETCHING nearly at the same time, and tried to fetch the same artifacts. The first one failed to fetch some artifacts and transitioned to DESTROYING. However, the second nested container got stock in FETCHING and the LAUNCH_NESTED_CONTAINER call never returned.

      I0226 06:27:15.000000  9494 http.cpp:2581] Processing LAUNCH_NESTED_CONTAINER call for container 'fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4'
      ...
      I0226 06:27:15.000000  9499 http.cpp:2581] Processing LAUNCH_NESTED_CONTAINER call for container 'fb23f12b-8ae3-4e03-9895-8df1b2865b11.3f6fead4-1857-4cbd-b226-fbc7337eb8cb'
      ...
      I0226 06:27:15.000000  9493 containerizer.cpp:2968] Transitioning the state of container fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4 from ISOLATING to FETCHING
      I0226 06:27:15.000000  9500 containerizer.cpp:2968] Transitioning the state of container fb23f12b-8ae3-4e03-9895-8df1b2865b11.3f6fead4-1857-4cbd-b226-fbc7337eb8cb from ISOLATING to FETCHING
      ...
      E0226 06:29:45.000000  9496 fetcher.cpp:568] Failed to run mesos-fetcher: Failed to fetch all URIs for container 'fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4': exited with status 1
      W0226 06:29:45.000000  9497 http.cpp:2758] Failed to launch container fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4: Failed to fetch all URIs for container 'fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4': exited with status 1
      I0226 06:29:45.000000  9497 containerizer.cpp:2354] Destroying container fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4 in FETCHING state
      I0226 06:29:45.000000  9497 containerizer.cpp:2968] Transitioning the state of container fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4 from FETCHING to DESTROYING
      

      After a closer look at the sandbox logs, these two containers bypassed the fetcher cache when downloading artifacts, and thus should not be interfering with each other. However, the log of the stuck container stopped at "Downloading resource from ...". This might indicate that the server (here it's Amazon S3) accepted the connection but never finished a HTTP response. To avoid containers being stuck in FETCHING, we should add a download timeout to abort the fetching when the download speed is too low.

      Attachments

        Issue Links

          Activity

            People

              chhsia0 Chun-Hung Hsiao
              chhsia0 Chun-Hung Hsiao
              Gilbert Song Gilbert Song
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: