Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9951

A likely STW problem in master registry's gc routine

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      I'm using a 1.7.3 master, which seemed to stop for half a minute recently.

      // I0820 20:53:56.705075 4185864 registrar.cpp:487] Applied 1 operations in 1.163968ms; attempting to update the registry
      I0820 20:53:56.705541 4185861 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 353
      I0820 20:53:56.705739 4185875 replica.cpp:541] Replica received write request for position 353 from __req_res__(568)@10.10.23.74:5050
      I0820 20:53:56.721997 4185859 master.cpp:8753] Executor 'mt:l00000000004115106217:1' of framework a878e862-349c-4206-bfb8-3048c841e8ec-0002 on agent bd5550a6-4089-482d-aa96-3389bae5b0de-S179 at slave(1)@10.153.38.24:5051 (10.153.38.24): exited with status 0
      I0820 20:53:56.722085 4185859 master.cpp:11215] Removing executor 'mt:l00000000004115106217:1' with resources [] of framework a878e862-349c-4206-bfb8-3048c841e8ec-0002 on agent bd5550a6-4089-482d-aa96-3389bae5b0de-S179 at slave(1)@10.153.38.24:5051 (10.153.38.24)
      I0820 20:53:56.742550 4185877 replica.cpp:695] Replica received learned notice for position 353 from log-network(1)@10.10.23.74:5050
      I0820 20:53:56.784256 4185881 registrar.cpp:544] Successfully updated the registry in 79.105792ms
      I0820 20:53:56.784489 4185857 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 354
      I0820 20:53:56.784641 4185890 replica.cpp:541] Replica received write request for position 354 from __req_res__(571)@10.10.23.74:5050
      I0820 20:53:56.825901 4185890 replica.cpp:695] Replica received learned notice for position 354 from log-network(1)@10.10.23.74:5050
      I0820 20:54:34.798512 4185864 master.cpp:1978] Garbage collected 1 unreachable and 0 gone agents from the registry
      I0820 20:54:34.798610 4185864 master.cpp:8510] Status update TASK_FINISHED (Status UUID: 6304aa62-2854-4d46-ad09-ffbf3347f24b) for task mt:l00000000004115107127:1 of framework a878e862-349c-4206-bfb8-3048c841e8ec-0002 from agent bd5550a6-4089-482d-aa96-3389bae5b0de-S138 at slave(1)@10.17.44.133:5051 (10.17.44.133)
      

      Note that their are no log produced between 20:53:56 and 20:54:34.

      atop shows that a core(used by master) is full during the STW period.

      Attachments

        Activity

          People

            Unassigned Unassigned
            carlone longfei
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: