HBase
  1. HBase
  2. HBASE-2238

Review all transitions -- compactions, splits, region opens, log roll/splitting -- for crash-proofyness and atomicity

    Details

    • Type: Umbrella Umbrella
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This issue is about reviewing state transitions in hbase to ensure we're sufficently hardened against crashes. This issue I see as an umbrella issue under which we'd look at compactions, splits, log splits, region opens – what else is there? We'd look at each in turn to see how we survive crash at any time during the transition. For example, we think compactions idempotent but we need to prove it so. Splits are for sure not, not at the moment (Witness disabled parents but daughters missing or only one of them available).

      Part of this issue would be writing tests that aim to break transitions.

      In light of above, here is recent off-list note from Todd Lipcon (and "another"):

      I thought a bit more last night about the discussion we were having
      regarding various HBase components doing operations on the HDFS data,
      and ensuring that in various racy scenarios that we don't have two
      region servers or masters overlapping.
      
      I came to the conclusion that ZK data can't be used to actually have
      effective locks on HDFS directories, since we can never know that we
      still have a ZK lock when we do an operation. Thus the operations
      themselves have to be idempotent, or recoverable in the case of
      multiple nodes trying to do the same thing. Or, we have to use HDFS
      itself as a locking mechanism - this is what we discussed using write
      leases essentially as locks.
      
      Since I didn't really trust myself, I ran my thoughts by "Another"
      and he concurs (see
      below). Figured this is food for thought for designing HBase data
      management to be completely safe/correct.
      
      ...
      
      ---------- Forwarded message ----------
      From: Another <another@XXXXXX.com>
      Date: Wed, Feb 17, 2010 at 10:50 AM
      Subject: locks
      To: Todd Lipcon <todd@XXXXXXX.com>
      
      
      Short answer is no, you're right.
      Because HDFS and ZK are partitioned (in the sense that there's no
      communication between them) and there may be an unknown delay between
      acquiring the lock and performing the operation on HDFS you have no
      way of knowing that you still own the lock, like you say.
      If the lock cannot be revoked while you have it (no timeouts) then you
      can atomically check that you still have the lock and do the operation
      on HDFS, because checking is a no-op. Designing a system with no lock
      revocation in the face of failures is an exercise for the reader :)
      The right way is for HDFS and ZK to communicate to construct an atomic
      operation. ZK could give a token to the client which it also gives to
      HDFS, and HDFS uses that token to do admission control. There's
      probably some neat theorem about causality and the impossibility of
      doing distributed locking without a sufficiently strong atomic
      primitive here.
      
      Another
      

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          I think the general pattern that all transitions need to follow is:

          1) HLog that the RS intends to do some operation
          2) Perform the operation in a way that is still undoable (eg create compacted HFile but don't yet remove old ones)
          3) HLog that the RS has finished the action
          4) Clean up from part 2 (eg remove the pre-compaction HFiles)

          We assume:

          • Whenever a RS has failed, the master will open its HLog for append.
          • This steals the write lease and increases the generation stamp on its last block.
          • Thus the next time the RS attempts to hflush(), it will receive an IOException (I think a LeaseExpiredException to be specific?)

          Failure cases at each step:

          Fail before 1) no problem, data isn't touched
          Fail after 1 but before 3) the transition is an indeterminate state. When the master recovers, it can roll back to the pre-transition state
          Fail after 3) when the master recovers, it can complete the "cleanup" transition for the regionserver (even if the regionserver got halfway through cleanup)

          This pattern relies on cleanup being idempotent, and state transitions being undoable.

          The above examples are for the compaction case, but I think the same general ideas apply elsewhere.

          Show
          Todd Lipcon added a comment - I think the general pattern that all transitions need to follow is: 1) HLog that the RS intends to do some operation 2) Perform the operation in a way that is still undoable (eg create compacted HFile but don't yet remove old ones) 3) HLog that the RS has finished the action 4) Clean up from part 2 (eg remove the pre-compaction HFiles) We assume: Whenever a RS has failed, the master will open its HLog for append. This steals the write lease and increases the generation stamp on its last block. Thus the next time the RS attempts to hflush(), it will receive an IOException (I think a LeaseExpiredException to be specific?) Failure cases at each step: Fail before 1) no problem, data isn't touched Fail after 1 but before 3) the transition is an indeterminate state. When the master recovers, it can roll back to the pre-transition state Fail after 3) when the master recovers, it can complete the "cleanup" transition for the regionserver (even if the regionserver got halfway through cleanup) This pattern relies on cleanup being idempotent, and state transitions being undoable. The above examples are for the compaction case, but I think the same general ideas apply elsewhere.
          Hide
          Todd Lipcon added a comment -

          Lastly, as an optimization, we can add a step 5 on the regionserver which is "log that the state transition is entirely complete". Thus the master knows it doesn't have to do anything with regards to this transition.

          For discussion, it may be worth giving some terminology to the phases. It seems to me we have prepare (enters the "will rollback" state), then commit (enters the "will roll forward state"), then complete (ends the state machine, no action necessary).

          Show
          Todd Lipcon added a comment - Lastly, as an optimization, we can add a step 5 on the regionserver which is "log that the state transition is entirely complete". Thus the master knows it doesn't have to do anything with regards to this transition. For discussion, it may be worth giving some terminology to the phases. It seems to me we have prepare (enters the "will rollback" state), then commit (enters the "will roll forward state"), then complete (ends the state machine, no action necessary).
          Hide
          Jean-Daniel Cryans added a comment -

          I ran into a weird situation today running TestReplication (from HBASE-2223's latest patch up on rb), the test kills a region server by expiring its session and then the following happened (almost all at the same time):

          1. Master lists all hlogs to split (total of 12)
          2. RS does a log roll
          3. RS tries to register the new log in ZK for replication and fails because the session was expired, but the log is already rolled
          4. RS takes 3 more edits into the new log
          5. RS cleans 6 logs over 13
          6. Master fails at splitting the 3rd log it listed, delays the log splitting process
          7. Master tries again to split logs, lists 7 of them and is successful

          In the end, the master wasn't missing any edits (because log splitting failed and got the new log the second time) but the slave cluster was missing 3. This makes me think that the region server should also do a better job at handling KeeperException.SessionExpiredException because we currently don't do it at all.

          Show
          Jean-Daniel Cryans added a comment - I ran into a weird situation today running TestReplication (from HBASE-2223 's latest patch up on rb), the test kills a region server by expiring its session and then the following happened (almost all at the same time): Master lists all hlogs to split (total of 12) RS does a log roll RS tries to register the new log in ZK for replication and fails because the session was expired, but the log is already rolled RS takes 3 more edits into the new log RS cleans 6 logs over 13 Master fails at splitting the 3rd log it listed, delays the log splitting process Master tries again to split logs, lists 7 of them and is successful In the end, the master wasn't missing any edits (because log splitting failed and got the new log the second time) but the slave cluster was missing 3. This makes me think that the region server should also do a better job at handling KeeperException.SessionExpiredException because we currently don't do it at all.
          Hide
          Jean-Daniel Cryans added a comment -

          I'm upgrading this to blocker for 0.21, any GC that kills a RS that rolls after sleeping and still gets some edits can result in data loss.

          Show
          Jean-Daniel Cryans added a comment - I'm upgrading this to blocker for 0.21, any GC that kills a RS that rolls after sleeping and still gets some edits can result in data loss.
          Hide
          Jonathan Gray added a comment -

          Over in HBASE-2696 we now have one place that we handle ZK Expired and Disconnected events. We need to figure out how to handle a Disconnect, especially on RS side, which relates to much of the discussion here.

          Show
          Jonathan Gray added a comment - Over in HBASE-2696 we now have one place that we handle ZK Expired and Disconnected events. We need to figure out how to handle a Disconnect, especially on RS side, which relates to much of the discussion here.
          Hide
          stack added a comment -

          Made this an umbrella issue, critical rather than blocker and moved it out of 0.90.

          The reason to take down priority is that perhaps half of the issues raised in this umbrella have been addressed elsewhere (review of splits and log splitting). The other items have also had some work done (region open) but more to do so will leave issue open though moving it out of 0.90 since thought is we can release w/o completion (I'm in a room w/ j-d and ryan going over 0.90 issues and this is what we think).

          Show
          stack added a comment - Made this an umbrella issue, critical rather than blocker and moved it out of 0.90. The reason to take down priority is that perhaps half of the issues raised in this umbrella have been addressed elsewhere (review of splits and log splitting). The other items have also had some work done (region open) but more to do so will leave issue open though moving it out of 0.90 since thought is we can release w/o completion (I'm in a room w/ j-d and ryan going over 0.90 issues and this is what we think).

            People

            • Assignee:
              Unassigned
              Reporter:
              stack
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:

                Development