Uploaded image for project: 'Slider'
  1. Slider
  2. SLIDER-1189

Agent never connects to new AM if AM restart takes too long

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • Slider 0.92
    • agent
    • None

    Description

      In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited for a bit, then restarted the RM. The AM is restarted, but running agents never connect to the new AM. The AM data is re-read from the ZK registry once if the heartbeat retry threshold is reached, at which point the agent tries re-registering with the AM. However, if the AM data is stale at that point, it never re-reads the data from the ZK registry, and retries registering with the nonexistent AM forever (until it is timed out due to heartbeat loss and killed by the new AM).

      Note this happens when AM restart is delayed more than about a minute, which can occur if the RM is down or the RM is up but busy.

      Attachments

        1. SLIDER-1189.1.patch
          3 kB
          Billie Rinaldi
        2. SLIDER-1189.2.patch
          4 kB
          Billie Rinaldi
        3. SLIDER-1189.3.patch
          4 kB
          Billie Rinaldi

        Activity

          People

            billie Billie Rinaldi
            billie Billie Rinaldi
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment