HBase
  1. HBase
  2. HBASE-4335

Splits can create temporary holes in .META. that confuse clients and regionservers

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.90.4
    • Fix Version/s: 0.92.0
    • Component/s: regionserver
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When a SplitTransaction is performed, three updates are done to .META.:
      1. The parent region is marked as splitting (and hence offline)
      2. The first daughter region is added (same start key as parent)
      3. The second daughter region is added (split key is start key)
      (later, the original parent region is deleted, but that's not important to this discussion)

      Steps 2 and 3 are actually done concurrently by SplitTransaction.DaughterOpener threads. While the master is notified when a split is complete, the only visibility that clients have is whether the daughter regions have appeared in .META.

      If the second daughter is added to .META. first, then .META. will contain the (offline) parent region followed by the second daughter region. If the client looks up a key that is greater than (or equal to) the split, the client will find the second daughter region and use it. If the key is less than the split key, the client will find the parent region and see that it is offline, triggering a retry.

      If the first daughter is added to .META. before the second daughter, there is a window during which .META. has a hole: the first daughter effectively hides the parent region (same start key), but there is no entry for the second daughter. A region lookup will find the first daughter for all keys in the parent's range, but the first daughter does not include keys at or beyond the split key.

      See HBASE-4333 and HBASE-4334 for details on how this causes problems and suggestions for mitigating this in the client and regionserver.

      1. 4335.txt
        4 kB
        Lars Hofhansl
      2. 4335-v2.txt
        4 kB
        Lars Hofhansl
      3. 4335-v3.txt
        15 kB
        Lars Hofhansl
      4. 4335-v4.txt
        14 kB
        Lars Hofhansl
      5. 4335-v5.txt
        15 kB
        Lars Hofhansl

        Issue Links

          Activity

          Hide
          Joe Pallas added a comment -

          Replacing getClosestRowBefore would have implications on how to correctly handle temporary holes created during splitting.

          Show
          Joe Pallas added a comment - Replacing getClosestRowBefore would have implications on how to correctly handle temporary holes created during splitting.
          Hide
          Lars Hofhansl added a comment -

          To restate the problem:
          If the first daughter is added to .META. first, any key lookup for a key >= splitKey would incorrectly return the first daughter region.
          This seems like a legitimate, if rare, problem.

          Checking the end key would work for all point operation (put,get,delete,icv,cap,etc), and most already do that (except GET as you state in HBASE-4334). I don't think scans do that either, and not sure how it would work for scans. Hmm... Seems like it could work, and scans are serial and start the next region with the last value from the previous region, so if the startKey was checked we would catch this and could do a retry.

          I think it is better to avoid holes, though, overlap between active and offline-split regions seem fine.

          So what about splitting up the splitting process? The daughters are added to .META. in postOpenDeployTasks. That does not contain any long running operations.
          What if we removed that from the DaughterOpener threads and call it synchronous in the right order? (And also need to add the special stopped || stopping case).

          Show
          Lars Hofhansl added a comment - To restate the problem: If the first daughter is added to .META. first, any key lookup for a key >= splitKey would incorrectly return the first daughter region. This seems like a legitimate, if rare, problem. Checking the end key would work for all point operation (put,get,delete,icv,cap,etc), and most already do that (except GET as you state in HBASE-4334 ). I don't think scans do that either, and not sure how it would work for scans. Hmm... Seems like it could work, and scans are serial and start the next region with the last value from the previous region, so if the startKey was checked we would catch this and could do a retry. I think it is better to avoid holes, though, overlap between active and offline-split regions seem fine. So what about splitting up the splitting process? The daughters are added to .META. in postOpenDeployTasks. That does not contain any long running operations. What if we removed that from the DaughterOpener threads and call it synchronous in the right order? (And also need to add the special stopped || stopping case).
          Hide
          Lars Hofhansl added a comment -

          Something like this.

          Comments?

          Show
          Lars Hofhansl added a comment - Something like this. Comments?
          Hide
          Lars Hofhansl added a comment -

          Note: Obviously the change of order in which the threads are started/joined makes no difference. Just thought that might help with readability.

          Show
          Lars Hofhansl added a comment - Note: Obviously the change of order in which the threads are started/joined makes no difference. Just thought that might help with readability.
          Hide
          Joe Pallas added a comment -

          That looks like the sort of solution I envisioned. I am not sure that changing the order of starting/joining the threads actually helps; I think it may just be a distraction for the reader.

          Show
          Joe Pallas added a comment - That looks like the sort of solution I envisioned. I am not sure that changing the order of starting/joining the threads actually helps; I think it may just be a distraction for the reader.
          Hide
          Lars Hofhansl added a comment -

          Slightly different patch. Restored order to thread execution, one more comment

          TestSplitTransaction still passes.

          If something can go wrong in a distributed system it will, so we should fix this.

          Show
          Lars Hofhansl added a comment - Slightly different patch. Restored order to thread execution, one more comment TestSplitTransaction still passes. If something can go wrong in a distributed system it will, so we should fix this.
          Hide
          stack added a comment -

          Patch looks good to me. Shall I try make a test?

          This is not you:

          +    if (stopped || stopping) {
          +      // add 2nd daughter first (see HBASE-4335)
          +      MetaEditor.addDaughter(server.getCatalogTracker(),
          +          b.getRegionInfo(), null);
          

          ... but I'm trying to recall why this a good idea. We may be stopping because cluster is going down and .META. usually is last to go down but thats not guaranteed.

          We could put in this patch as is and address this other item in another issue.

          Show
          stack added a comment - Patch looks good to me. Shall I try make a test? This is not you: + if (stopped || stopping) { + // add 2nd daughter first (see HBASE-4335) + MetaEditor.addDaughter(server.getCatalogTracker(), + b.getRegionInfo(), null ); ... but I'm trying to recall why this a good idea. We may be stopping because cluster is going down and .META. usually is last to go down but thats not guaranteed. We could put in this patch as is and address this other item in another issue.
          Hide
          Ted Yu added a comment -

          Do all unit tests pass ?

          Show
          Ted Yu added a comment - Do all unit tests pass ?
          Hide
          Lars Hofhansl added a comment -

          @Stack... I was wondering about that too. It is checked in the beginning of execute and then was rechecked inside DautherOpener, but without any additional locks held (so not sure what additional guarantees we get.)

          Does it have to do with the order w.r.t. making zookeeper changes? When DaughterOpener is run we already made the zk changes, and that might be reason that now we have to go ahead and also write to .META. even if the RegionServer was stopped or is stopping... not sure...

          To be safe I moved it out and kept it.

          @Ted... Let me find out

          Show
          Lars Hofhansl added a comment - @Stack... I was wondering about that too. It is checked in the beginning of execute and then was rechecked inside DautherOpener, but without any additional locks held (so not sure what additional guarantees we get.) Does it have to do with the order w.r.t. making zookeeper changes? When DaughterOpener is run we already made the zk changes, and that might be reason that now we have to go ahead and also write to .META. even if the RegionServer was stopped or is stopping... not sure... To be safe I moved it out and kept it. @Ted... Let me find out
          Hide
          Lars Hofhansl added a comment -

          @Stack... Not sure how you can test the actual problem unless you either (1) instrument the code to introduce an artificial delay triggered by test code or (2) artificially create holes in meta. Neither are good options IMHO.

          Maybe a probabilistic test that just splits a table and calls get simultaneously.

          Show
          Lars Hofhansl added a comment - @Stack... Not sure how you can test the actual problem unless you either (1) instrument the code to introduce an artificial delay triggered by test code or (2) artificially create holes in meta. Neither are good options IMHO. Maybe a probabilistic test that just splits a table and calls get simultaneously.
          Hide
          Joe Pallas added a comment -

          @Lars: My experience with testing without an artificial delay suggests it is not a good option if you want to see the test actually fail without the fix. It just doesn’t happen frequently, although it does happen.

          If I were working on this code, I would consider improving unit testability by making MetaEditor into an interface that can be mocked. I realize that “If I were working on this code …” is not very helpful, but I don’t have permission from my employer to make any code contributions at the moment .

          Show
          Joe Pallas added a comment - @Lars: My experience with testing without an artificial delay suggests it is not a good option if you want to see the test actually fail without the fix. It just doesn’t happen frequently, although it does happen. If I were working on this code, I would consider improving unit testability by making MetaEditor into an interface that can be mocked. I realize that “If I were working on this code …” is not very helpful, but I don’t have permission from my employer to make any code contributions at the moment .
          Hide
          Lars Hofhansl added a comment -

          @Joe... Understood. And you are right about mocking. In this case it also makes sense to break execute into three parts: (1) the setup before opening the region (2) opening the regions (3) the post open stuff. A test can then call 1 and 3 and mock 2 (for example by using the DaughterOpeners serially.
          @Stack... Did you want to work on the test or should I?

          Show
          Lars Hofhansl added a comment - @Joe... Understood. And you are right about mocking. In this case it also makes sense to break execute into three parts: (1) the setup before opening the region (2) opening the regions (3) the post open stuff. A test can then call 1 and 3 and mock 2 (for example by using the DaughterOpeners serially. @Stack... Did you want to work on the test or should I?
          Hide
          Lars Hofhansl added a comment -

          Ran all test. There are three failures locally (TestHTablePool, TestDistributedLogSplitting, TestAdmin), and they fail with or without my change.

          Show
          Lars Hofhansl added a comment - Ran all test. There are three failures locally (TestHTablePool, TestDistributedLogSplitting, TestAdmin), and they fail with or without my change.
          Hide
          Lars Hofhansl added a comment -

          I think when this is fixed we can mark HBASE-4333 and HBASE-4334 as dups.
          @Joe, do you agree? Or do you think there is still merit in fixing the specific issues mentioned there?

          Show
          Lars Hofhansl added a comment - I think when this is fixed we can mark HBASE-4333 and HBASE-4334 as dups. @Joe, do you agree? Or do you think there is still merit in fixing the specific issues mentioned there?
          Hide
          Lars Hofhansl added a comment -

          I split up SplitTransaction.execute is three phases. I called them phaseI, phaseII, and phaseIII (please suggest better names). This is also nice because these can be separately documented now.

          All three are called from execute and the result is the same, but they can now be called separately (from classed in the same package, i.e. tests). This should allow me to add a test.

          Show
          Lars Hofhansl added a comment - I split up SplitTransaction.execute is three phases. I called them phaseI, phaseII, and phaseIII (please suggest better names). This is also nice because these can be separately documented now. All three are called from execute and the result is the same, but they can now be called separately (from classed in the same package, i.e. tests). This should allow me to add a test.
          Hide
          Jonathan Hsieh added a comment -

          HBASE-4377 is one potential way to fix meta tables that have holes after this bug occurs.

          Show
          Jonathan Hsieh added a comment - HBASE-4377 is one potential way to fix meta tables that have holes after this bug occurs.
          Hide
          Joe Pallas added a comment -

          @Lars, I think there is evidence that this is not the only way holes in meta might appear, so I would not mark HBASE-4333 and HBASE-4334 as duplicates. I think HBASE-4334 is the important one (the server should not assume the client is correct), and HBASE-4333 is more of an optimization (discover the problem at the client instead of the server).

          Show
          Joe Pallas added a comment - @Lars, I think there is evidence that this is not the only way holes in meta might appear, so I would not mark HBASE-4333 and HBASE-4334 as duplicates. I think HBASE-4334 is the important one (the server should not assume the client is correct), and HBASE-4333 is more of an optimization (discover the problem at the client instead of the server).
          Hide
          Lars Hofhansl added a comment -

          Agreed that HBASE-4334 is important. At least holes would not go silently. I'll work on that one too.

          Show
          Lars Hofhansl added a comment - Agreed that HBASE-4334 is important. At least holes would not go silently. I'll work on that one too.
          Hide
          Lars Hofhansl added a comment -

          New patch. Breaks SplitTransaction.execute into three parts.
          In part to make the phases clear, in part so that a test can test each of the phases independently.

          Also added a test. The test uses phaseI and phaseIII directly and mocks a bit with phaseII (that's the one that bring the daughters online and updates .META.)

          I could validate that if I change the order back to what is was before this patch the client would indeed reach the wrong region if querying past the split key and would (before HBASE-4334) silently return an empty result set.

          Let me know what you think about this change.

          TestSplitTransaction and the new TestEndToEndSplitTransaction pass.

          Show
          Lars Hofhansl added a comment - New patch. Breaks SplitTransaction.execute into three parts. In part to make the phases clear, in part so that a test can test each of the phases independently. Also added a test. The test uses phaseI and phaseIII directly and mocks a bit with phaseII (that's the one that bring the daughters online and updates .META.) I could validate that if I change the order back to what is was before this patch the client would indeed reach the wrong region if querying past the split key and would (before HBASE-4334 ) silently return an empty result set. Let me know what you think about this change. TestSplitTransaction and the new TestEndToEndSplitTransaction pass.
          Hide
          Jonathan Hsieh added a comment -

          @Lars some nits / suggestions on v3.

          TestEndToEndSplitTransaction needs license.

          Maybe a more descriptive function names for phaseI, phaseII, phaseIII?

          Any reason for the (overly?) general Class... instead of just taking a single Class and checking for null when no exceptions expected? Or maybe just make 'test' return boolean and assertTrue/assertFalse?

          Show
          Jonathan Hsieh added a comment - @Lars some nits / suggestions on v3. TestEndToEndSplitTransaction needs license. Maybe a more descriptive function names for phaseI, phaseII, phaseIII? Any reason for the (overly?) general Class... instead of just taking a single Class and checking for null when no exceptions expected? Or maybe just make 'test' return boolean and assertTrue/assertFalse?
          Hide
          Ted Yu added a comment -

          How about calling the first phase createDaughtersPhase, second openDaughtersPhase and the third phase transitionZKNodePhase ?

          Show
          Ted Yu added a comment - How about calling the first phase createDaughtersPhase, second openDaughtersPhase and the third phase transitionZKNodePhase ?
          Hide
          Lars Hofhansl added a comment -

          I like those names. Will do.
          @Jon. Initially i expected multiple different exceptions to thrown hence the general class approach. You're right here it does not make sense.

          Show
          Lars Hofhansl added a comment - I like those names. Will do. @Jon. Initially i expected multiple different exceptions to thrown hence the general class approach. You're right here it does not make sense.
          Hide
          Lars Hofhansl added a comment -

          Renamed phaseX methods addressed nits...

          Show
          Lars Hofhansl added a comment - Renamed phaseX methods addressed nits...
          Hide
          Lars Hofhansl added a comment -

          Added license notice to new test.
          Do I see some +1's?

          Show
          Lars Hofhansl added a comment - Added license notice to new test. Do I see some +1's?
          Hide
          Ted Yu added a comment -

          @Lars:
          You would get +1's after declaring the passing of test suite.

          Show
          Ted Yu added a comment - @Lars: You would get +1's after declaring the passing of test suite.
          Hide
          Lars Hofhansl added a comment -

          Results :

          Tests run: 1031, Failures: 0, Errors: 0, Skipped: 16

          [INFO] ------------------------------------------------------------------------
          [INFO] BUILD SUCCESSFUL
          [INFO] ------------------------------------------------------------------------

          Show
          Lars Hofhansl added a comment - Results : Tests run: 1031, Failures: 0, Errors: 0, Skipped: 16 [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------------
          Hide
          Ted Yu added a comment -

          Pass.
          +1.

          Show
          Ted Yu added a comment - Pass. +1.
          Hide
          Jonathan Hsieh added a comment -

          lgtm

          Show
          Jonathan Hsieh added a comment - lgtm
          Hide
          ramkrishna.s.vasudevan added a comment -

          +1. Your test case comments looks good. May be you can also take a look at HBASE-4562 and HBASE-4563 as it is related to Splits.

          Show
          ramkrishna.s.vasudevan added a comment - +1. Your test case comments looks good. May be you can also take a look at HBASE-4562 and HBASE-4563 as it is related to Splits.
          Hide
          Lars Hofhansl added a comment -

          Integrated into 0.92 and trunk.

          Show
          Lars Hofhansl added a comment - Integrated into 0.92 and trunk.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #55 (See https://builds.apache.org/job/HBase-0.92/55/)
          Integrating HBASE-4335

          larsh :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestEndToEndSplitTransaction.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #55 (See https://builds.apache.org/job/HBase-0.92/55/ ) Integrating HBASE-4335 larsh : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestEndToEndSplitTransaction.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2318 (See https://builds.apache.org/job/HBase-TRUNK/2318/)
          Integrating HBASE-4335

          larsh :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestEndToEndSplitTransaction.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2318 (See https://builds.apache.org/job/HBase-TRUNK/2318/ ) Integrating HBASE-4335 larsh : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestEndToEndSplitTransaction.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java
          Hide
          Lars Hofhansl added a comment -

          Double checked that the failed test pass locally. Will wait for the next build to see.

          Show
          Lars Hofhansl added a comment - Double checked that the failed test pass locally. Will wait for the next build to see.
          Hide
          Jonathan Gray added a comment -

          @LarsH, in the future, please have your svn commit message be in the same format as the CHANGES.txt update (ie. HBASE-#### The title description (author [via committer])

          Show
          Jonathan Gray added a comment - @LarsH, in the future, please have your svn commit message be in the same format as the CHANGES.txt update (ie. HBASE-#### The title description (author [via committer] )
          Hide
          Lars Hofhansl added a comment -

          Ted already pointed that out to me

          Show
          Lars Hofhansl added a comment - Ted already pointed that out to me

            People

            • Assignee:
              Lars Hofhansl
              Reporter:
              Joe Pallas
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development