Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-1758

Freezer failure leads to lost task during container destruction.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.1
    • Component/s: containerization
    • Labels:
      None
    • Target Version/s:
    • Sprint:
      Mesos Q3 Sprint 5
    • Story Points:
      2

      Description

      In the past we've seen numerous issues around the freezer. Lately, on the 2.6.44 kernel, we've seen issues where we're unable to freeze the cgroup:

      (1) An oom occurs.
      (2) No indication of oom in the kernel logs.
      (3) The slave is unable to freeze the cgroup.
      (4) The task is marked as lost.

      I0903 16:46:24.956040 25469 mem.cpp:575] Memory limit exceeded: Requested: 15488MB Maximum Used: 15488MB
      
      MEMORY STATISTICS:
      cache 7958691840
      rss 8281653248
      mapped_file 9474048
      pgpgin 4487861
      pgpgout 522933
      pgfault 2533780
      pgmajfault 11
      inactive_anon 0
      active_anon 8281653248
      inactive_file 7631708160
      active_file 326852608
      unevictable 0
      hierarchical_memory_limit 16240345088
      total_cache 7958691840
      total_rss 8281653248
      total_mapped_file 9474048
      total_pgpgin 4487861
      total_pgpgout 522933
      total_pgfault 2533780
      total_pgmajfault 11
      total_inactive_anon 0
      total_active_anon 8281653248
      total_inactive_file 7631728640
      total_active_file 326852608
      total_unevictable 0
      I0903 16:46:24.956848 25469 containerizer.cpp:1041] Container bbb9732a-d600-4c1b-b326-846338c608c3 has reached its limit for resource mem(*):1.62403e+10 and will be terminated
      I0903 16:46:24.957427 25469 containerizer.cpp:909] Destroying container 'bbb9732a-d600-4c1b-b326-846338c608c3'
      I0903 16:46:24.958664 25481 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:46:34.959529 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:46:34.962070 25482 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.710848ms
      I0903 16:46:34.962658 25479 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:46:44.963349 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:46:44.965631 25472 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.588224ms
      I0903 16:46:44.966356 25472 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:46:54.967254 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:46:56.008447 25475 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 2.15296ms
      I0903 16:46:56.009071 25466 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:06.010329 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:06.012538 25467 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.643008ms
      I0903 16:47:06.013216 25467 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:12.516348 25480 slave.cpp:3030] Current usage 9.57%. Max allowed age: 5.630238827780799days
      I0903 16:47:16.015192 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:16.017043 25486 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.511168ms
      I0903 16:47:16.017555 25480 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:19.862746 25483 http.cpp:245] HTTP request for '/slave(1)/stats.json'
      E0903 16:47:24.960055 25472 slave.cpp:2557] Termination of executor 'E' of framework '201104070004-0000002563-0000' failed: Failed to destroy container: discarded future
      I0903 16:47:24.962054 25472 slave.cpp:2087] Handling status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000 from @0.0.0.0:0
      I0903 16:47:24.963470 25469 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 128MB for container bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:24.963541 25471 cpushare.cpp:338] Updated 'cpu.shares' to 256 (cpus 0.25) for container bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:24.964756 25471 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 25ms (cpus 0.25) for container bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:43.406610 25476 status_update_manager.cpp:320] Received status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000
      I0903 16:47:43.406991 25476 status_update_manager.hpp:342] Checkpointing UPDATE for status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000
      I0903 16:47:43.410475 25476 status_update_manager.cpp:373] Forwarding status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000 to master@<scrubbed_ip>:5050
      I0903 16:47:43.439923 25480 status_update_manager.cpp:398] Received status update acknowledgement (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000
      I0903 16:47:43.440115 25480 status_update_manager.hpp:342] Checkpointing ACK for status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000
      I0903 16:47:43.443595 25480 slave.cpp:2709] Cleaning up executor 'E' of framework 201104070004-0000002563-0000
      

      We should consider avoiding the freezer entirely in favor of a kill(2) loop. We don't have to wait for pid namespaces to remove the freezer dependency.

      At the very least, when the freezer fails, we should proceed with a kill(2) loop to ensure that we destroy the cgroup.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                vinodkone Vinod Kone
                Reporter:
                bmahler Benjamin Mahler
                Shepherd:
                Ian Downes
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: