Uploaded image for project: 'Slider'
  1. Slider
  2. SLIDER-1189

Agent never connects to new AM if AM restart takes too long

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Slider 0.92
    • Component/s: agent
    • Labels:
      None

      Description

      In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited for a bit, then restarted the RM. The AM is restarted, but running agents never connect to the new AM. The AM data is re-read from the ZK registry once if the heartbeat retry threshold is reached, at which point the agent tries re-registering with the AM. However, if the AM data is stale at that point, it never re-reads the data from the ZK registry, and retries registering with the nonexistent AM forever (until it is timed out due to heartbeat loss and killed by the new AM).

      Note this happens when AM restart is delayed more than about a minute, which can occur if the RM is down or the RM is up but busy.

        Attachments

        1. SLIDER-1189.1.patch
          3 kB
          Billie Rinaldi
        2. SLIDER-1189.2.patch
          4 kB
          Billie Rinaldi
        3. SLIDER-1189.3.patch
          4 kB
          Billie Rinaldi

          Activity

            People

            • Assignee:
              billie Billie Rinaldi
              Reporter:
              billie Billie Rinaldi
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: