Accumulo
  1. Accumulo
  2. ACCUMULO-2112

master does not balance after intermittent communication failure

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.0, 1.4.1, 1.4.2, 1.4.3, 1.4.4, 1.5.0, 1.5.1
    • Fix Version/s: 1.4.5, 1.5.1, 1.6.0
    • Component/s: master
    • Labels:
      None

      Description

      The master had a momentary connection timeout error collecting stats from a single tablet server. Because the connection was re-established on the next attempt, the master did not remove it from the bad servers list. Because the bad server list was not cleared, it did not re-balance.

        Issue Links

          Activity

          Hide
          ASF subversion and git services added a comment -

          Commit f56ae10b3e72e6d03fa6324afcd23619ea94b7b9 in branch refs/heads/1.4.5-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=f56ae10 ]

          ACCUMULO-2112 clear the bad server list of any server that is communicating

          Show
          ASF subversion and git services added a comment - Commit f56ae10b3e72e6d03fa6324afcd23619ea94b7b9 in branch refs/heads/1.4.5-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=f56ae10 ] ACCUMULO-2112 clear the bad server list of any server that is communicating
          Hide
          ASF subversion and git services added a comment -

          Commit f56ae10b3e72e6d03fa6324afcd23619ea94b7b9 in branch refs/heads/1.5.1-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=f56ae10 ]

          ACCUMULO-2112 clear the bad server list of any server that is communicating

          Show
          ASF subversion and git services added a comment - Commit f56ae10b3e72e6d03fa6324afcd23619ea94b7b9 in branch refs/heads/1.5.1-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=f56ae10 ] ACCUMULO-2112 clear the bad server list of any server that is communicating
          Hide
          ASF subversion and git services added a comment -

          Commit f56ae10b3e72e6d03fa6324afcd23619ea94b7b9 in branch refs/heads/1.6.0-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=f56ae10 ]

          ACCUMULO-2112 clear the bad server list of any server that is communicating

          Show
          ASF subversion and git services added a comment - Commit f56ae10b3e72e6d03fa6324afcd23619ea94b7b9 in branch refs/heads/1.6.0-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=f56ae10 ] ACCUMULO-2112 clear the bad server list of any server that is communicating
          Hide
          ASF subversion and git services added a comment -

          Commit f56ae10b3e72e6d03fa6324afcd23619ea94b7b9 in branch refs/heads/master from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=f56ae10 ]

          ACCUMULO-2112 clear the bad server list of any server that is communicating

          Show
          ASF subversion and git services added a comment - Commit f56ae10b3e72e6d03fa6324afcd23619ea94b7b9 in branch refs/heads/master from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=f56ae10 ] ACCUMULO-2112 clear the bad server list of any server that is communicating
          Hide
          Michael Wall added a comment -

          This issue showed up in the master logs as "ERROR: unable to get tablet server status". The tservers appeared to lose connection for a brief time, less than 30 seconds, but then start communicating again. The server would then show up by IP in the list of Unresponsive servers and by hostname in the Tablet Servers when looking at the Tablet Server page of the monitor.

          I can verify applying this one line fix to the 1.4.4 tag removes the server from the list of unresponsive servers and balancing begins again when there are no unresponsive servers.

          The "unable to get server status" should still show up in the master logs. Maybe it is actually meaningful.

          Show
          Michael Wall added a comment - This issue showed up in the master logs as "ERROR: unable to get tablet server status". The tservers appeared to lose connection for a brief time, less than 30 seconds, but then start communicating again. The server would then show up by IP in the list of Unresponsive servers and by hostname in the Tablet Servers when looking at the Tablet Server page of the monitor. I can verify applying this one line fix to the 1.4.4 tag removes the server from the list of unresponsive servers and balancing begins again when there are no unresponsive servers. The "unable to get server status" should still show up in the master logs. Maybe it is actually meaningful.

            People

            • Assignee:
              Eric Newton
              Reporter:
              Eric Newton
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development