Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-1758

Freezer failure leads to lost task during container destruction.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.1
    • Component/s: containerization
    • Labels:
      None
    • Target Version/s:
    • Sprint:
      Mesos Q3 Sprint 5
    • Story Points:
      2

      Description

      In the past we've seen numerous issues around the freezer. Lately, on the 2.6.44 kernel, we've seen issues where we're unable to freeze the cgroup:

      (1) An oom occurs.
      (2) No indication of oom in the kernel logs.
      (3) The slave is unable to freeze the cgroup.
      (4) The task is marked as lost.

      I0903 16:46:24.956040 25469 mem.cpp:575] Memory limit exceeded: Requested: 15488MB Maximum Used: 15488MB
      
      MEMORY STATISTICS:
      cache 7958691840
      rss 8281653248
      mapped_file 9474048
      pgpgin 4487861
      pgpgout 522933
      pgfault 2533780
      pgmajfault 11
      inactive_anon 0
      active_anon 8281653248
      inactive_file 7631708160
      active_file 326852608
      unevictable 0
      hierarchical_memory_limit 16240345088
      total_cache 7958691840
      total_rss 8281653248
      total_mapped_file 9474048
      total_pgpgin 4487861
      total_pgpgout 522933
      total_pgfault 2533780
      total_pgmajfault 11
      total_inactive_anon 0
      total_active_anon 8281653248
      total_inactive_file 7631728640
      total_active_file 326852608
      total_unevictable 0
      I0903 16:46:24.956848 25469 containerizer.cpp:1041] Container bbb9732a-d600-4c1b-b326-846338c608c3 has reached its limit for resource mem(*):1.62403e+10 and will be terminated
      I0903 16:46:24.957427 25469 containerizer.cpp:909] Destroying container 'bbb9732a-d600-4c1b-b326-846338c608c3'
      I0903 16:46:24.958664 25481 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:46:34.959529 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:46:34.962070 25482 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.710848ms
      I0903 16:46:34.962658 25479 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:46:44.963349 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:46:44.965631 25472 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.588224ms
      I0903 16:46:44.966356 25472 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:46:54.967254 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:46:56.008447 25475 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 2.15296ms
      I0903 16:46:56.009071 25466 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:06.010329 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:06.012538 25467 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.643008ms
      I0903 16:47:06.013216 25467 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:12.516348 25480 slave.cpp:3030] Current usage 9.57%. Max allowed age: 5.630238827780799days
      I0903 16:47:16.015192 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:16.017043 25486 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.511168ms
      I0903 16:47:16.017555 25480 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:19.862746 25483 http.cpp:245] HTTP request for '/slave(1)/stats.json'
      E0903 16:47:24.960055 25472 slave.cpp:2557] Termination of executor 'E' of framework '201104070004-0000002563-0000' failed: Failed to destroy container: discarded future
      I0903 16:47:24.962054 25472 slave.cpp:2087] Handling status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000 from @0.0.0.0:0
      I0903 16:47:24.963470 25469 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 128MB for container bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:24.963541 25471 cpushare.cpp:338] Updated 'cpu.shares' to 256 (cpus 0.25) for container bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:24.964756 25471 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 25ms (cpus 0.25) for container bbb9732a-d600-4c1b-b326-846338c608c3
      I0903 16:47:43.406610 25476 status_update_manager.cpp:320] Received status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000
      I0903 16:47:43.406991 25476 status_update_manager.hpp:342] Checkpointing UPDATE for status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000
      I0903 16:47:43.410475 25476 status_update_manager.cpp:373] Forwarding status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000 to master@<scrubbed_ip>:5050
      I0903 16:47:43.439923 25480 status_update_manager.cpp:398] Received status update acknowledgement (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000
      I0903 16:47:43.440115 25480 status_update_manager.hpp:342] Checkpointing ACK for status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000
      I0903 16:47:43.443595 25480 slave.cpp:2709] Cleaning up executor 'E' of framework 201104070004-0000002563-0000
      

      We should consider avoiding the freezer entirely in favor of a kill(2) loop. We don't have to wait for pid namespaces to remove the freezer dependency.

      At the very least, when the freezer fails, we should proceed with a kill(2) loop to ensure that we destroy the cgroup.

        Issue Links

          Activity

          Hide
          yasumoto Joe Smith added a comment -

          Can we make sure this gets into 0.21.0? This is continuing to hit us with LOST tasks, so just want to make sure it gets included.

          Thanks!

          Show
          yasumoto Joe Smith added a comment - Can we make sure this gets into 0.21.0? This is continuing to hit us with LOST tasks, so just want to make sure it gets included. Thanks!
          Hide
          jieyu Jie Yu added a comment -

          Instead of investigating more time on fixing cgroups freezer, I am in favor of implementing PID namespace as that will be our ultimate solution.

          Show
          jieyu Jie Yu added a comment - Instead of investigating more time on fixing cgroups freezer, I am in favor of implementing PID namespace as that will be our ultimate solution.
          Hide
          vinodkone Vinod Kone added a comment -

          short term fix: https://reviews.apache.org/r/25457/ until we get PID namespace support.

          Show
          vinodkone Vinod Kone added a comment - short term fix: https://reviews.apache.org/r/25457/ until we get PID namespace support.
          Hide
          vinodkone Vinod Kone added a comment -

          commit 63ed98634444f927beb2cf074aacc838fb601329
          Author: Vinod Kone <vinodkone@gmail.com>
          Date: Mon Sep 8 15:40:54 2014 -0700

          Added kill() to freezerTimedOut() in cgroups.cpp.
          This is a short-term fix for MESOS-1758.

          Review: https://reviews.apache.org/r/25457

          Show
          vinodkone Vinod Kone added a comment - commit 63ed98634444f927beb2cf074aacc838fb601329 Author: Vinod Kone <vinodkone@gmail.com> Date: Mon Sep 8 15:40:54 2014 -0700 Added kill() to freezerTimedOut() in cgroups.cpp. This is a short-term fix for MESOS-1758 . Review: https://reviews.apache.org/r/25457

            People

            • Assignee:
              vinodkone Vinod Kone
              Reporter:
              bmahler Benjamin Mahler
              Shepherd:
              Ian Downes
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development

                  Agile