Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-4353

Stabilize tablet assignment during transient failure

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.8.0
    • Component/s: None
    • Labels:
      None

      Description

      When a tablet server dies, Accumulo attempts to reassign the tablets it was hosting as quickly as possible to maintain availability. If multiple tablet servers die in quick succession, such as from a rolling restart of the Accumulo cluster or a network partition, this behavior can cause a storm of reassignment and rebalancing, placing significant load on the master.

      To avert such load, Accumulo should be capable of maintaining a steady tablet assignment state in the face of transient tablet server loss. Instead of reassigning tablets as quickly as possible, Accumulo should be await the return of a temporarily downed tablet server (for some configurable duration) before assigning its tablets to other tablet servers.

        Issue Links

          Activity

          Hide
          elserj Josh Elser added a comment -

          If multiple tablet servers die in quick succession, such as from a rolling restart of the Accumulo cluster or a network partition, this behavior can cause a storm of reassignment and rebalancing, placing significant load on the master.

          In the case of a rolling-restart, isn't that more of an operational issue to solve (give Accumulo time to process the reassignments before restarting more nodes)?

          Can you expand a bit more on the specifics behind a network partition that you're trying to solve with this? Say, you lose a node? A rack? Half of your racks? Are we talking about a 5second interruption/delay? 30s? Minutes?

          To avert such load, Accumulo should be capable of maintaining a steady tablet assignment state in the face of transient tablet server loss. Instead of reassigning tablets as quickly as possible, Accumulo should be await the return of a temporarily downed tablet server (for some configurable duration) before assigning its tablets to other tablet servers.

          I'm a little worried about this as a configuration knob – I feel like it kind of goes against the highly-available distributed database which we expect Accumulo to be. When we don't reassign tablets fast, that is a direct lack of availability for clients to read data.

          placing significant load on the master

          Can you expand on this some more? Given that assignment is arguably the most important thing for the Master to do, why are we concerned about letting the master do that as fast as it can (for the aforementioned reason)? Do we need to come up with a more efficient way for the master to handle the reassignment of many tablets? For example, I know that HBase has special logic to batch assignments together in one RPC call to regionservers in order to bring things online more quickly (I'm not sure if we have logic like that – instead we just spam requests to a tabletserver to load a tablet until it does so).

          While seeing a pull request accompanying the issue reported, It seems a bit premature to me to see code without some discussion on what the problems are and how best to solve them.

          Show
          elserj Josh Elser added a comment - If multiple tablet servers die in quick succession, such as from a rolling restart of the Accumulo cluster or a network partition, this behavior can cause a storm of reassignment and rebalancing, placing significant load on the master. In the case of a rolling-restart, isn't that more of an operational issue to solve (give Accumulo time to process the reassignments before restarting more nodes)? Can you expand a bit more on the specifics behind a network partition that you're trying to solve with this? Say, you lose a node? A rack? Half of your racks? Are we talking about a 5second interruption/delay? 30s? Minutes? To avert such load, Accumulo should be capable of maintaining a steady tablet assignment state in the face of transient tablet server loss. Instead of reassigning tablets as quickly as possible, Accumulo should be await the return of a temporarily downed tablet server (for some configurable duration) before assigning its tablets to other tablet servers. I'm a little worried about this as a configuration knob – I feel like it kind of goes against the highly-available distributed database which we expect Accumulo to be. When we don't reassign tablets fast, that is a direct lack of availability for clients to read data. placing significant load on the master Can you expand on this some more? Given that assignment is arguably the most important thing for the Master to do, why are we concerned about letting the master do that as fast as it can (for the aforementioned reason)? Do we need to come up with a more efficient way for the master to handle the reassignment of many tablets? For example, I know that HBase has special logic to batch assignments together in one RPC call to regionservers in order to bring things online more quickly (I'm not sure if we have logic like that – instead we just spam requests to a tabletserver to load a tablet until it does so). While seeing a pull request accompanying the issue reported, It seems a bit premature to me to see code without some discussion on what the problems are and how best to solve them.
          Hide
          phrocker marco polo added a comment - - edited

          Are you attempting to design a mechanism that could be used to avoid re-balancing and have the master keep assignments where they were previously, knowing that servers will come back into operation?

          I only ask because I question why the load on the master is a problem. You will cause load since clients will persist for that length of time. Wont' you increase the number of thrift connections waiting since you may not re-balance for some time?

          Show
          phrocker marco polo added a comment - - edited Are you attempting to design a mechanism that could be used to avoid re-balancing and have the master keep assignments where they were previously, knowing that servers will come back into operation? I only ask because I question why the load on the master is a problem. You will cause load since clients will persist for that length of time. Wont' you increase the number of thrift connections waiting since you may not re-balance for some time?
          Hide
          dlmarion Dave Marion added a comment -

          Can you expand on this some more? Given that assignment is arguably the most important thing for the Master to do, why are we concerned about letting the master do that as fast as it can (for the aforementioned reason)? Do we need to come up with a more efficient way for the master to handle the reassignment of many tablets?

          Reading through this, and bringing some first-hand experience, I don't think the issue is the the Master assigning tablets. It's the issue of tablet servers that are down for a short period of time. When a tserver goes down, the Master re-assigns the tablets. When the tserver comes back up, it goes through several rounds of balancing which could take a long time and cause a lot of churn.

          I'm a little worried about this as a configuration knob – I feel like it kind of goes against the highly-available distributed database which we expect Accumulo to be. When we don't reassign tablets fast, that is a direct lack of availability for clients to read data.

          I don't see any harm done here as long as the default behavior is what happens today. Allowing an administrator to choose to delay tablet reassignment may not fit most use cases, but it could fit some.

          My 2 cents.

          Show
          dlmarion Dave Marion added a comment - Can you expand on this some more? Given that assignment is arguably the most important thing for the Master to do, why are we concerned about letting the master do that as fast as it can (for the aforementioned reason)? Do we need to come up with a more efficient way for the master to handle the reassignment of many tablets? Reading through this, and bringing some first-hand experience, I don't think the issue is the the Master assigning tablets. It's the issue of tablet servers that are down for a short period of time. When a tserver goes down, the Master re-assigns the tablets. When the tserver comes back up, it goes through several rounds of balancing which could take a long time and cause a lot of churn. I'm a little worried about this as a configuration knob – I feel like it kind of goes against the highly-available distributed database which we expect Accumulo to be. When we don't reassign tablets fast, that is a direct lack of availability for clients to read data. I don't see any harm done here as long as the default behavior is what happens today. Allowing an administrator to choose to delay tablet reassignment may not fit most use cases, but it could fit some. My 2 cents.
          Hide
          elserj Josh Elser added a comment -

          I don't see any harm done here as long as the default behavior is what happens today. Allowing an administrator to choose to delay tablet reassignment may not fit most use cases, but it could fit some.

          You're right that there isn't any harm to existing users, but there's always a concern of technical debt (in terms of complexity) and architectural goals. I'm more worried about where we're going that this change causing harm to an existing user.

          If this is really about trying to make rolling-restarts better, I'd encourage a look at ACCUMULO-1454. Keith Turner and I (and others) had a very lengthy discussion on the subject, but never sat down to work on an implementation.

          Show
          elserj Josh Elser added a comment - I don't see any harm done here as long as the default behavior is what happens today. Allowing an administrator to choose to delay tablet reassignment may not fit most use cases, but it could fit some. You're right that there isn't any harm to existing users, but there's always a concern of technical debt (in terms of complexity) and architectural goals. I'm more worried about where we're going that this change causing harm to an existing user. If this is really about trying to make rolling-restarts better, I'd encourage a look at ACCUMULO-1454 . Keith Turner and I (and others) had a very lengthy discussion on the subject, but never sat down to work on an implementation.
          Hide
          dlmarion Dave Marion added a comment -

          there's always a concern of technical debt (in terms of complexity)

          Fair enough.

          If this is really about trying to make rolling-restarts better.

          Not sure, I think it has to do with quick unplanned restarts, but it would be good to clear that up. I see a rolling restart being an intentional, planned activity. I think (based on "failure" in the title) this is for the unintentional, unplanned short duration outage (e.g. losing connectivity to a rack for a short time) where the administrator wants to bring the failed tablet servers up as soon as possible.

          Show
          dlmarion Dave Marion added a comment - there's always a concern of technical debt (in terms of complexity) Fair enough. If this is really about trying to make rolling-restarts better. Not sure, I think it has to do with quick unplanned restarts, but it would be good to clear that up. I see a rolling restart being an intentional, planned activity. I think (based on "failure" in the title) this is for the unintentional, unplanned short duration outage (e.g. losing connectivity to a rack for a short time) where the administrator wants to bring the failed tablet servers up as soon as possible.
          Hide
          ShawnWalker Shawn Walker added a comment - - edited

          I did have rolling restarts as a primary motivation for doing this work, though a few other scenarios did come to mind as potential applications:

          • tserver loses lock (possibly due to load), dies, and is restarted quickly via some external infrastructure, e.g. Puppet
          • temporary network connectivity loss

          I was thinking a `table.suspend.duration` on the order of 2-3 minutes might make sense for general purposes in a large cluster. Long enough to catch most truly transient problems, sufficiently short that many applications wouldn't be unduly impacted. Particularly seeing as any application already has to deal with a ~30 second wait before the master really notices a tablet server gone anyways. After all, Accumulo is ultimately a consistent+partition tolerant database, not an available+partition tolerant database. If availability is a user's top priority, other databases (e.g. Apache Cassandra) offer tradeoffs in that direction.

          I hadn't seen ACCUMULO-1454, I'll take a closer look in the morning. One concern that I had with some rolling-restart ideas was a matter of ops complexity. In my (admittedly limited) experience, orchestrating a rolling restart that needs to do much more than "kill daemon, restart daemon" over a large cluster can be a huge headache.

          Show
          ShawnWalker Shawn Walker added a comment - - edited I did have rolling restarts as a primary motivation for doing this work, though a few other scenarios did come to mind as potential applications: tserver loses lock (possibly due to load), dies, and is restarted quickly via some external infrastructure, e.g. Puppet temporary network connectivity loss I was thinking a `table.suspend.duration` on the order of 2-3 minutes might make sense for general purposes in a large cluster. Long enough to catch most truly transient problems, sufficiently short that many applications wouldn't be unduly impacted. Particularly seeing as any application already has to deal with a ~30 second wait before the master really notices a tablet server gone anyways. After all, Accumulo is ultimately a consistent+partition tolerant database, not an available+partition tolerant database. If availability is a user's top priority, other databases (e.g. Apache Cassandra) offer tradeoffs in that direction. I hadn't seen ACCUMULO-1454 , I'll take a closer look in the morning. One concern that I had with some rolling-restart ideas was a matter of ops complexity. In my (admittedly limited) experience, orchestrating a rolling restart that needs to do much more than "kill daemon, restart daemon" over a large cluster can be a huge headache.
          Hide
          ShawnWalker Shawn Walker added a comment -

          It's not so much the cost of performing assignments which is a worry, as the cost of rebalancing when a tserver comes back. But these concerns aren't independent. By blocking premature reassignment (at some cost to availability), the cost of rebalancing is saved. Particularly in something like a rolling restart, where we can expect that a tserver's death will be short-lived, and that the tserver will not die again after it revives.

          Show
          ShawnWalker Shawn Walker added a comment - It's not so much the cost of performing assignments which is a worry, as the cost of rebalancing when a tserver comes back. But these concerns aren't independent. By blocking premature reassignment (at some cost to availability), the cost of rebalancing is saved. Particularly in something like a rolling restart, where we can expect that a tserver's death will be short-lived, and that the tserver will not die again after it revives.
          Hide
          ShawnWalker Shawn Walker added a comment - - edited

          Are you attempting to design a mechanism that could be used to avoid re-balancing and have the master keep assignments where they were previously, knowing that servers will come back into operation?

          That is the idea, yes. There is definitely a tradeoff here.

          If this is really about trying to make rolling-restarts better, I'd encourage a look at ACCUMULO-1454.

          As I mentioned before, I hadn't seen ACCUMULO-1454 before starting this. I've now looked at the discussion of that ticket. What I implemented was approximately what Christopher Tubbs and David Medinets were suggesting. I also read through Keith Turner's design proposal summary. I have some reservations with it:

          • It requires that each planned restart involves tablet servers changing ports. While the recent changes to Accumulo to support a narrow port range during port search would make this more plausible, it might still prove difficult to establish firewall rules for Accumulo. (Sean Busby raises this issue in the discussion).
          • What happens if a tablet is split after migration starts? It seems to me there might be a race condition here which would lead to incomplete migration between sibling tablet servers. Do we block assignment during the rolling restart, too? That seems seems like a cure worse than the problem.
          • Even barring those two concerns, I again raise the spectre of ops complexity. To transition a single server, I need to know (a) which port the "old" tserver was running on, and (b) which port the "new" tserver is running on. If I'm using some sort of dynamic port assignment (which I would need to unless I pointed the "new" tserver at an entirely different configuration), it could be non-trivial to gather these pieces of information. While the burden on the operator of a cluster of 5 tservers might not be significant, the burden on the operator of a cluster of 200 tservers might make this approach infeasible. And the non-triviality of determining the correct port migration mapping would also make the process difficult to robustly automate.

          While seeing a pull request accompanying the issue reported, It seems a bit premature to me to see code without some discussion on what the problems are and how best to solve them.

          Ahh, my mistake then. As a new contributor to Accumulo, I still don't have a full grasp of the rules, either written or unwritten. My feeling from watching the list was that primary modus operandi was to present a (fully implemented) solution along with a proposed problem, and then to discuss the merits of the solution.

          Show
          ShawnWalker Shawn Walker added a comment - - edited Are you attempting to design a mechanism that could be used to avoid re-balancing and have the master keep assignments where they were previously, knowing that servers will come back into operation? That is the idea, yes. There is definitely a tradeoff here. If this is really about trying to make rolling-restarts better, I'd encourage a look at ACCUMULO-1454 . As I mentioned before, I hadn't seen ACCUMULO-1454 before starting this. I've now looked at the discussion of that ticket. What I implemented was approximately what Christopher Tubbs and David Medinets were suggesting. I also read through Keith Turner's design proposal summary. I have some reservations with it: It requires that each planned restart involves tablet servers changing ports. While the recent changes to Accumulo to support a narrow port range during port search would make this more plausible, it might still prove difficult to establish firewall rules for Accumulo. (Sean Busby raises this issue in the discussion). What happens if a tablet is split after migration starts? It seems to me there might be a race condition here which would lead to incomplete migration between sibling tablet servers. Do we block assignment during the rolling restart, too? That seems seems like a cure worse than the problem. Even barring those two concerns, I again raise the spectre of ops complexity. To transition a single server, I need to know (a) which port the "old" tserver was running on, and (b) which port the "new" tserver is running on. If I'm using some sort of dynamic port assignment (which I would need to unless I pointed the "new" tserver at an entirely different configuration), it could be non-trivial to gather these pieces of information. While the burden on the operator of a cluster of 5 tservers might not be significant, the burden on the operator of a cluster of 200 tservers might make this approach infeasible. And the non-triviality of determining the correct port migration mapping would also make the process difficult to robustly automate. While seeing a pull request accompanying the issue reported, It seems a bit premature to me to see code without some discussion on what the problems are and how best to solve them. Ahh, my mistake then. As a new contributor to Accumulo, I still don't have a full grasp of the rules, either written or unwritten. My feeling from watching the list was that primary modus operandi was to present a (fully implemented) solution along with a proposed problem, and then to discuss the merits of the solution.
          Hide
          elserj Josh Elser added a comment -

          Ahh, my mistake then. As a new contributor to Accumulo, I still don't have a full grasp of the rules, either written or unwritten. My feeling from watching the list was that primary modus operandi was to present a (fully implemented) solution along with a proposed problem, and then to discuss the merits of the solution.

          In general at Apache, it's best to discuss a large/intrusive change to the codebase before you spend time writing code (in case there is consensus on an approach different than what you thought best). It goes back to the community-first aspect. It's fine that you provided code right away (I don't mean to be scolding), I just don't want to see you waste a week writing some code for a consensus that a different direction than what you wrote to was agreed upon.

          Show
          elserj Josh Elser added a comment - Ahh, my mistake then. As a new contributor to Accumulo, I still don't have a full grasp of the rules, either written or unwritten. My feeling from watching the list was that primary modus operandi was to present a (fully implemented) solution along with a proposed problem, and then to discuss the merits of the solution. In general at Apache, it's best to discuss a large/intrusive change to the codebase before you spend time writing code (in case there is consensus on an approach different than what you thought best). It goes back to the community-first aspect. It's fine that you provided code right away (I don't mean to be scolding), I just don't want to see you waste a week writing some code for a consensus that a different direction than what you wrote to was agreed upon.
          Hide
          kturner Keith Turner added a comment -

          I am in favor of what Shawn has done in his patch over what was proposed in ACCUMULO-1454. I think whats in this patch is much easier to use. One thing I was trying to accomplish in the proposal for ACCUMULO-1454 was to minimize down time by allowing a new tserver to be started on a node and everything migrated before killing the old one. This may minimize down time, but I think it would be very complex to actually do this in practice as Shawn pointed out.

          Show
          kturner Keith Turner added a comment - I am in favor of what Shawn has done in his patch over what was proposed in ACCUMULO-1454 . I think whats in this patch is much easier to use. One thing I was trying to accomplish in the proposal for ACCUMULO-1454 was to minimize down time by allowing a new tserver to be started on a node and everything migrated before killing the old one. This may minimize down time, but I think it would be very complex to actually do this in practice as Shawn pointed out.
          Hide
          kturner Keith Turner added a comment -

          This is good advice. I wouldn't limit it to thinking about the case where possibly something may have been agreed upon. Its just good in general to try to leverage the community to create better solutions. However sometimes it can be tricky sometimes to know when to engage and how often to engage. Given how busy everyone is, ideally one would want to minimize the amount of the community's time used while maximizing the benefit from the community to reach a solution. Of course its impossible to optimize this.

          Show
          kturner Keith Turner added a comment - This is good advice. I wouldn't limit it to thinking about the case where possibly something may have been agreed upon. Its just good in general to try to leverage the community to create better solutions. However sometimes it can be tricky sometimes to know when to engage and how often to engage. Given how busy everyone is, ideally one would want to minimize the amount of the community's time used while maximizing the benefit from the community to reach a solution. Of course its impossible to optimize this.
          Hide
          mjwall Michael Wall added a comment -

          PR is closed and code is merged in. I am closing this ticket. Thanks Shawn Walker

          Show
          mjwall Michael Wall added a comment - PR is closed and code is merged in. I am closing this ticket. Thanks Shawn Walker

            People

            • Assignee:
              ShawnWalker Shawn Walker
              Reporter:
              ShawnWalker Shawn Walker
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 7h 50m
                7h 50m

                  Development