Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9951

A likely STW problem in master registry's gc routine

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I'm using a 1.7.3 master, which seemed to stop for half a minute recently.

      // I0820 20:53:56.705075 4185864 registrar.cpp:487] Applied 1 operations in 1.163968ms; attempting to update the registry
      I0820 20:53:56.705541 4185861 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 353
      I0820 20:53:56.705739 4185875 replica.cpp:541] Replica received write request for position 353 from __req_res__(568)@10.10.23.74:5050
      I0820 20:53:56.721997 4185859 master.cpp:8753] Executor 'mt:l00000000004115106217:1' of framework a878e862-349c-4206-bfb8-3048c841e8ec-0002 on agent bd5550a6-4089-482d-aa96-3389bae5b0de-S179 at slave(1)@10.153.38.24:5051 (10.153.38.24): exited with status 0
      I0820 20:53:56.722085 4185859 master.cpp:11215] Removing executor 'mt:l00000000004115106217:1' with resources [] of framework a878e862-349c-4206-bfb8-3048c841e8ec-0002 on agent bd5550a6-4089-482d-aa96-3389bae5b0de-S179 at slave(1)@10.153.38.24:5051 (10.153.38.24)
      I0820 20:53:56.742550 4185877 replica.cpp:695] Replica received learned notice for position 353 from log-network(1)@10.10.23.74:5050
      I0820 20:53:56.784256 4185881 registrar.cpp:544] Successfully updated the registry in 79.105792ms
      I0820 20:53:56.784489 4185857 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 354
      I0820 20:53:56.784641 4185890 replica.cpp:541] Replica received write request for position 354 from __req_res__(571)@10.10.23.74:5050
      I0820 20:53:56.825901 4185890 replica.cpp:695] Replica received learned notice for position 354 from log-network(1)@10.10.23.74:5050
      I0820 20:54:34.798512 4185864 master.cpp:1978] Garbage collected 1 unreachable and 0 gone agents from the registry
      I0820 20:54:34.798610 4185864 master.cpp:8510] Status update TASK_FINISHED (Status UUID: 6304aa62-2854-4d46-ad09-ffbf3347f24b) for task mt:l00000000004115107127:1 of framework a878e862-349c-4206-bfb8-3048c841e8ec-0002 from agent bd5550a6-4089-482d-aa96-3389bae5b0de-S138 at slave(1)@10.17.44.133:5051 (10.17.44.133)
      

      Note that their are no log produced between 20:53:56 and 20:54:34.

      atop shows that a core(used by master) is full during the STW period.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              carlone longfei
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: