Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7711

Master updates registry for reregistering agents even when they haven't been unreachable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.0
    • Component/s: master
    • Labels:
      None

      Description

      During a master failover we observed many registry updates, on average one per two agents, as indicated by the log line

      I0609 04:46:25.220196 48864 registrar.cpp:550] Successfully updated the registry in 42.904064ms
      

      code

      In this case few agents were ever unreachable so most of them are redundant. Associated with each registry update is also the time spent on applying the operations

      I0609 04:46:26.475761 48897 registrar.cpp:493] Applied 1 operations in 11.673082ms; attempting to update the registry
      

      code

      Even though not consuming the time of the Master actor, all agent reregistrations are guarded and delayed by these operations, and this could be easily avoided by checking with the slaves.recovered field in Master.

        Activity

        Show
        xujyan Yan Xu added a comment - https://reviews.apache.org/r/60854/ https://reviews.apache.org/r/60400/ https://reviews.apache.org/r/60898/
        Hide
        xujyan Yan Xu added a comment -
        commit 95fe8b367a94da6da0a580026519bf07a4f65ec7
        Author: Jiang Yan Xu <xujyan@apple.com>
        Date:   Sun Jul 16 11:10:00 2017 -0700
        
            Added more tests for agent reregistration.
            
            These new tests are specifically for verifying whether the registrar
            is involved when an agent reregisters, depending on whether it has
            been marked unreachable.
            
            Review: https://reviews.apache.org/r/60898
        
        commit ef66225896be26fd4e7b0bb914e2820366613470
        Author: Jiang Yan Xu <xujyan@apple.com>
        Date:   Thu Jun 22 14:01:27 2017 -0700
        
            Skipped consulting registry if the agent is in `slaves.recovered`.
            
            Agents in `slaves.recovered` haven't been marked unreachable and
            would have been in `slaves.registered` if the master has not failed
            over. So this is consistent with how the master in steady state handles
            reregistering agents by checking against `slaves.registered`.
            
            Review: https://reviews.apache.org/r/60400
        
        commit b43ceb8b97e4eca507a699113adcd311a071936f
        Author: Jiang Yan Xu <xujyan@apple.com>
        Date:   Thu Jul 13 15:49:59 2017 -0700
        
            Changed the way tests capture agent state transitioning.
            
            The existing way captures the registrar operation which may not
            occur after MESOS-7711 but regardless of that, capturing the agent
            authorization is equivalent and arguably more straightforward.
            
            Review: https://reviews.apache.org/r/60854
        
        Show
        xujyan Yan Xu added a comment - commit 95fe8b367a94da6da0a580026519bf07a4f65ec7 Author: Jiang Yan Xu <xujyan@apple.com> Date: Sun Jul 16 11:10:00 2017 -0700 Added more tests for agent reregistration. These new tests are specifically for verifying whether the registrar is involved when an agent reregisters, depending on whether it has been marked unreachable. Review: https://reviews.apache.org/r/60898 commit ef66225896be26fd4e7b0bb914e2820366613470 Author: Jiang Yan Xu <xujyan@apple.com> Date: Thu Jun 22 14:01:27 2017 -0700 Skipped consulting registry if the agent is in `slaves.recovered`. Agents in `slaves.recovered` haven't been marked unreachable and would have been in `slaves.registered` if the master has not failed over. So this is consistent with how the master in steady state handles reregistering agents by checking against `slaves.registered`. Review: https://reviews.apache.org/r/60400 commit b43ceb8b97e4eca507a699113adcd311a071936f Author: Jiang Yan Xu <xujyan@apple.com> Date: Thu Jul 13 15:49:59 2017 -0700 Changed the way tests capture agent state transitioning. The existing way captures the registrar operation which may not occur after MESOS-7711 but regardless of that, capturing the agent authorization is equivalent and arguably more straightforward. Review: https://reviews.apache.org/r/60854

          People

          • Assignee:
            xujyan Yan Xu
            Reporter:
            xujyan Yan Xu
            Shepherd:
            James Peach
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development