HBase
  1. HBase
  2. HBASE-4195

Possible inconsistency in a memstore read after a reseek, possible performance improvement

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.90.4
    • Fix Version/s: 0.92.0
    • Component/s: regionserver
    • Labels:
      None
    • Environment:

      all

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      This fixes the regression (identified by a random failure of test TestHRegion#testWritesWhileGetting) introduced by HBASE-3855. In production, the failure could occur purely randomly depending on the size of the MemStore list, or during a parallel execution of a scanner#next and a memstore flush. The fix improves the scanner reseek performance as well. The parameter "hbase.hregion.memstore.reseek.maxkeys" is not used anymore.
      Show
      This fixes the regression (identified by a random failure of test TestHRegion#testWritesWhileGetting) introduced by HBASE-3855 . In production, the failure could occur purely randomly depending on the size of the MemStore list, or during a parallel execution of a scanner#next and a memstore flush. The fix improves the scanner reseek performance as well. The parameter "hbase.hregion.memstore.reseek.maxkeys" is not used anymore.

      Description

      This follows the dicussion around HBASE-3855, and the random errors (20% failure on trunk) on the unit test org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting

      I saw some points related to numIterReseek, used in the MemStoreScanner#getNext (line 690):

      679	    protected KeyValue getNext(Iterator it) {
      680	      KeyValue ret = null;
      681	      long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
      682	      //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + readPoint);
      683	 
      684	      while (ret == null && it.hasNext()) {
      685	        KeyValue v = it.next();
      686	        if (v.getMemstoreTS() <= readPoint) {
      687	          // keep it.
      688	          ret = v;
      689	        }
      690	        numIterReseek--;
      691	        if (numIterReseek == 0) {
      692	          break;
      693	         }
      694	      }
      695	      return ret;
      696	    }

      This function is called by seek, reseek, and next. The numIterReseek is only usefull for reseek.

      There are some issues, I am not totally sure it's the root cause of the test case error, but it could explain partly the randomness of the error, and one point is for sure a bug.

      1) In getNext, numIterReseek is decreased, then compared to zero. The seek function sets numIterReseek to zero before calling getNext. It means that the value will be actually negative, hence the test will always fail, and the loop will continue. It is the expected behaviour, but it's quite smart.

      2) In "reseek", numIterReseek is not set between the loops on the two iterators. If the numIterReseek is equals to zero after the loop on the first one, the loop on the second one will never call seek, as numIterReseek will be negative.

      3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && numIterReseek == 0). In other words, if kvsetNextRow is not null when numIterReseek equals zero, numIterReseek will start to be negative at the next iteration and seek will never be called.

      4) You can have side effects if reseek ends with a numIterReseek > 0: the following calls to the "next" function will decrease numIterReseek to zero, and getNext will break instead of continuing the loop. As a result, later calls to next() may return null or not depending on how is configured the default value for numIterReseek.

      To check if the issue comes from point 4, you can set the numIterReseek to zero before returning in reseek:

            numIterReseek = 0;
            return (kvsetNextRow != null || snapshotNextRow != null);
          }

      On my env, on trunk, it seems to work, but as it's random I am not really sure. I also had to modify the test (I added a loop) to make it fails more often, the original test was working quite well here.

      It has to be confirmed that this totally fix (it could be partial or unrelated) org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting before implementing a complete solution.

      1. 20110824_4195_MemStore.patch
        4 kB
        Nicolas Liochon
      2. 20110824_4195_TestHRegion.patch
        2 kB
        Nicolas Liochon
      3. 20110915_4195_4188_MemStore.patch
        9 kB
        Nicolas Liochon
      4. 4195-v2.txt
        9 kB
        Ted Yu
      5. 4195-v3.txt
        9 kB
        Ted Yu

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2222 (See https://builds.apache.org/job/HBase-TRUNK/2222/)
          HBASE-4195 Possible inconsistency in a memstore read after a reseek, possible performance improvement

          stack :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2222 (See https://builds.apache.org/job/HBase-TRUNK/2222/ ) HBASE-4195 Possible inconsistency in a memstore read after a reseek, possible performance improvement stack : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
          Hide
          Ted Yu added a comment -

          Patch v3 addresses the two places in comments.

          Show
          Ted Yu added a comment - Patch v3 addresses the two places in comments.
          Hide
          stack added a comment -

          Committed to TRUNK and fixed the Ted and Andrew comments on commit. Resolving against 0.92. I think we might want to commit this to 0.90 to get over hbase-3855. Waiting on Andrew feedback. Nice one N. Thanks.

          Show
          stack added a comment - Committed to TRUNK and fixed the Ted and Andrew comments on commit. Resolving against 0.92. I think we might want to commit this to 0.90 to get over hbase-3855. Waiting on Andrew feedback. Nice one N. Thanks.
          Hide
          Nicolas Liochon added a comment -

          Ok, thank you.
          The remaining points I saw in 4195-v2.txt are the one mentioned above by Ted at 16/Sep/11 04:11 and by Andrew at 16/Sep/11 01:19 (modifications in the comments not in the code).

          Show
          Nicolas Liochon added a comment - Ok, thank you. The remaining points I saw in 4195-v2.txt are the one mentioned above by Ted at 16/Sep/11 04:11 and by Andrew at 16/Sep/11 01:19 (modifications in the comments not in the code).
          Hide
          stack added a comment -

          I can do clean up on commit N. No need of new patch. Let me run some tests first.

          Show
          stack added a comment - I can do clean up on commit N. No need of new patch. Let me run some tests first.
          Hide
          Ted Yu added a comment -

          @N:
          4195-v2.txt has taken care of heap size issue.

          Show
          Ted Yu added a comment - @N: 4195-v2.txt has taken care of heap size issue.
          Hide
          Nicolas Liochon added a comment -

          Thank you. What is the next step? Should I write a new patch taking into account all the different comment?

          Show
          Nicolas Liochon added a comment - Thank you. What is the next step? Should I write a new patch taking into account all the different comment?
          Hide
          Andrew Purtell added a comment -

          Yes, it's because I removed a field in MemStore (the reseek counter). It seems that the fix is to change 12 to 11 in MemStore.java

          This is the correct fix for that.

          Show
          Andrew Purtell added a comment - Yes, it's because I removed a field in MemStore (the reseek counter). It seems that the fix is to change 12 to 11 in MemStore.java This is the correct fix for that.
          Hide
          Nicolas Liochon added a comment -

          Same here, I got one failure with TestGetRowVersions, but it seems unrelated (it worked once & failed once).
          Other unit tests in errors fail with and without the patch.

          Show
          Nicolas Liochon added a comment - Same here, I got one failure with TestGetRowVersions, but it seems unrelated (it worked once & failed once). Other unit tests in errors fail with and without the patch.
          Hide
          Ted Yu added a comment -

          Test suite passed.

          Show
          Ted Yu added a comment - Test suite passed.
          Hide
          Nicolas Liochon added a comment -

          Note that I am currently running the full unit test and checking the failures are not related to this fix.

          Show
          Nicolas Liochon added a comment - Note that I am currently running the full unit test and checking the failures are not related to this fix.
          Hide
          Ted Yu added a comment -

          @N:
          You're right.
          Patch v2 passes TestHeapSize.

          Let's see what Stack says about it.

          Show
          Ted Yu added a comment - @N: You're right. Patch v2 passes TestHeapSize. Let's see what Stack says about it.
          Hide
          Nicolas Liochon added a comment -

          Yes, it's because I removed a field in MemStore (the reseek counter).

          It seems that the fix is to change 12 to 11 in MemStore.java

           
            public final static long FIXED_OVERHEAD = ClassSize.align(
                ClassSize.OBJECT + (12 * ClassSize.REFERENCE));
          

          The unit test is happy after this change, do you confirm it's the right solution?

          Show
          Nicolas Liochon added a comment - Yes, it's because I removed a field in MemStore (the reseek counter). It seems that the fix is to change 12 to 11 in MemStore.java public final static long FIXED_OVERHEAD = ClassSize.align( ClassSize.OBJECT + (12 * ClassSize.REFERENCE)); The unit test is happy after this change, do you confirm it's the right solution?
          Hide
          Nicolas Liochon added a comment -

          I am going to check.

          Show
          Nicolas Liochon added a comment - I am going to check.
          Hide
          Ted Yu added a comment -

          I think the following test failure (w.r.t. MemStore.FIXED_OVERHEAD) is related to this JIRA:

          testSizes(org.apache.hadoop.hbase.io.TestHeapSize)  Time elapsed: 0.056 sec  <<< FAILURE!
          junit.framework.AssertionFailedError: expected:<104> but was:<112>
                  at junit.framework.Assert.fail(Assert.java:47)
                  at junit.framework.Assert.failNotEquals(Assert.java:283)
                  at junit.framework.Assert.assertEquals(Assert.java:64)
                  at junit.framework.Assert.assertEquals(Assert.java:130)
                  at junit.framework.Assert.assertEquals(Assert.java:136)
                  at org.apache.hadoop.hbase.io.TestHeapSize.testSizes(TestHeapSize.java:272)
          
          Show
          Ted Yu added a comment - I think the following test failure (w.r.t. MemStore.FIXED_OVERHEAD) is related to this JIRA: testSizes(org.apache.hadoop.hbase.io.TestHeapSize) Time elapsed: 0.056 sec <<< FAILURE! junit.framework.AssertionFailedError: expected:<104> but was:<112> at junit.framework.Assert.fail(Assert.java:47) at junit.framework.Assert.failNotEquals(Assert.java:283) at junit.framework.Assert.assertEquals(Assert.java:64) at junit.framework.Assert.assertEquals(Assert.java:130) at junit.framework.Assert.assertEquals(Assert.java:136) at org.apache.hadoop.hbase.io.TestHeapSize.testSizes(TestHeapSize.java:272)
          Hide
          Ted Yu added a comment -

          Minor comment:

          +      // kvset and snapshot will never be null.
          +      // if tailSet can't find anything, SS is empty (not null).
          

          I think SS above should be replaced with SortedSet.

          Show
          Ted Yu added a comment - Minor comment: + // kvset and snapshot will never be null . + // if tailSet can't find anything, SS is empty (not null ). I think SS above should be replaced with SortedSet.
          Hide
          Andrew Purtell added a comment -

          Looks good, especially the new comments, +1.

          There is a minor typo in the comments that can be cleaned up on commit:

          +      1) t's not possible to use the 'kvTail' and 'snapshot'
          

          should be

          +      1) It's not possible to use the 'kvTail' and 'snapshot'
          
          Show
          Andrew Purtell added a comment - Looks good, especially the new comments, +1. There is a minor typo in the comments that can be cleaned up on commit: + 1) t's not possible to use the 'kvTail' and 'snapshot' should be + 1) It's not possible to use the 'kvTail' and 'snapshot'
          Hide
          Ted Yu added a comment -

          +1 on latest patch.
          A few empty lines can be removed at time of commit.

          Ran TestHRegion#testWritesWhileGetting 101 times which all passed based on latest patch.

          Show
          Ted Yu added a comment - +1 on latest patch. A few empty lines can be removed at time of commit. Ran TestHRegion#testWritesWhileGetting 101 times which all passed based on latest patch.
          Hide
          Nicolas Liochon added a comment -

          Patch for 4195 and 4188, taking in account all the points mentionned above. The patch :
          for the test case itself is recommended as well, but not mandatory.

          I have as well done some very minor refactoring on some functions that I was touching for the patch:

          • @Override added
          • getLower renamed to getLowest as in MemStore
          • test and temp var var removed in next()
          Show
          Nicolas Liochon added a comment - Patch for 4195 and 4188, taking in account all the points mentionned above. The patch : for the test case itself is recommended as well, but not mandatory. I have as well done some very minor refactoring on some functions that I was touching for the patch: @Override added getLower renamed to getLowest as in MemStore test and temp var var removed in next()
          Hide
          stack added a comment -

          Sure on single patch. Thanks.

          Show
          stack added a comment - Sure on single patch. Thanks.
          Hide
          Nicolas Liochon added a comment -

          Ok, will do (likely this week end). Do you mind if I do a single patch for this jira and for HBASE-4188? They both touch MemStore seek and reseek.

          Show
          Nicolas Liochon added a comment - Ok, will do (likely this week end). Do you mind if I do a single patch for this jira and for HBASE-4188 ? They both touch MemStore seek and reseek.
          Hide
          stack added a comment -

          N

          Would you mind making a patch that puts common code into a single method and that heavily docs what you've found; i.e. repeat in code your expectations above so if we want to change this code subsequently or the scope of readpoint changes, the editor will get the benefit of your rumination?

          Thanks boss.

          Show
          stack added a comment - N Would you mind making a patch that puts common code into a single method and that heavily docs what you've found; i.e. repeat in code your expectations above so if we want to change this code subsequently or the scope of readpoint changes, the editor will get the benefit of your rumination? Thanks boss.
          Hide
          Nicolas Liochon added a comment -

          @stack: yes I am ok with all your points. Thanks! Some details below:

          Are seek and reseek the same now? Or it seems like they have a bunch of common code... can we factor it out to common method if so?

          The initialization of kvTail & snapshotTail differs, then it's the same code. There are only 6 lines of code, but I aggree, it would be cleaner if shared in a private method (this would simplify as well the improvement on peek)

          We're fixing a bug where we may miss a Put if a flush comes in in meantime because we won't have a running Iterator on new KVSet (but maybe this is not such a big deal - perhaps - because its unlikely the new Put will be within the purview of the current read point?

          That's what I expect. Note that between the 3 implementations:

          • the initial one: it was impossible because we were just using the iterator without going back to the list.
          • the one currently in the tunk: possible because we're restarting from the very beginning of the list.
          • the proposed one; in the middle: we're not restarting from the beginning from from an intermediate point of the list.

          So we're not in the same situation as we were 2 years ago, but I expect (without having done a full analysis) that the readpoint will hide this.

          The best of the best, in terms of performance and similarity to the initial implementation, would be to get the sub-skiplist implictly pointed by the iterator, but there is nothing in the Java API to do it today: it would require to implement a specific skip list.

          Show
          Nicolas Liochon added a comment - @stack: yes I am ok with all your points. Thanks! Some details below: Are seek and reseek the same now? Or it seems like they have a bunch of common code... can we factor it out to common method if so? The initialization of kvTail & snapshotTail differs, then it's the same code. There are only 6 lines of code, but I aggree, it would be cleaner if shared in a private method (this would simplify as well the improvement on peek) We're fixing a bug where we may miss a Put if a flush comes in in meantime because we won't have a running Iterator on new KVSet (but maybe this is not such a big deal - perhaps - because its unlikely the new Put will be within the purview of the current read point? That's what I expect. Note that between the 3 implementations: the initial one: it was impossible because we were just using the iterator without going back to the list. the one currently in the tunk: possible because we're restarting from the very beginning of the list. the proposed one; in the middle: we're not restarting from the beginning from from an intermediate point of the list. So we're not in the same situation as we were 2 years ago, but I expect (without having done a full analysis) that the readpoint will hide this. The best of the best, in terms of performance and similarity to the initial implementation, would be to get the sub-skiplist implictly pointed by the iterator, but there is nothing in the Java API to do it today: it would require to implement a specific skip list.
          Hide
          stack added a comment -

          N:

          We should remove:

          this.reseekNumKeys = conf.getInt(RESEEKMAX_KEY, RESEEKMAX_DEFAULT);
          

          and reseekNumKeys and the defines else someone will come along later and wonder what these were about?

          Are seek and reseek the same now? Or it seems like they have a bunch of common code... can we factor it out to common method if so?

          To add to your synthesis:

          + We're removing the numIterReseek facility because we get new tailSet every time we reseek.
          + It looks like there is nice perf. benefit if we go w/ this patch
          + We're fixing a bug where we may miss a Put if a flush comes in in meantime because we won't have a running Iterator on new KVSet (but maybe this is not such a big deal – perhaps – because its unlikely the new Put will be within the purview of the current read point?

          If you agree on the above changes and synthesis, no need of a new patch. I"ll do the clean up on commit. Thanks boss.

          Show
          stack added a comment - N: We should remove: this .reseekNumKeys = conf.getInt(RESEEKMAX_KEY, RESEEKMAX_DEFAULT); and reseekNumKeys and the defines else someone will come along later and wonder what these were about? Are seek and reseek the same now? Or it seems like they have a bunch of common code... can we factor it out to common method if so? To add to your synthesis: + We're removing the numIterReseek facility because we get new tailSet every time we reseek. + It looks like there is nice perf. benefit if we go w/ this patch + We're fixing a bug where we may miss a Put if a flush comes in in meantime because we won't have a running Iterator on new KVSet (but maybe this is not such a big deal – perhaps – because its unlikely the new Put will be within the purview of the current read point? If you agree on the above changes and synthesis, no need of a new patch. I"ll do the clean up on commit. Thanks boss.
          Hide
          Ted Yu added a comment -

          I already voted +1 on latest patch.

          Show
          Ted Yu added a comment - I already voted +1 on latest patch.
          Hide
          Nicolas Liochon added a comment -

          Synthesis:

          • the patch on the test case itself helps to understand where the error come from, but does not change anything else
          • the fix on the memstore fix the menstore part.
          • there is still another issue, not linked to the memstore but to the flush part (see message above from 23/Aug/11)
          • the test cas may fail for this reason. The probability of a failure is increased by increasing the test count and lowering the flush interval
          • Anyway, the flush issue is already handled in HBASE-2856.

          So I think we can consider the patch as ok for the scope of this bug? @Ted : @Stack: do you agree? Do you need more info?

          Show
          Nicolas Liochon added a comment - Synthesis: the patch on the test case itself helps to understand where the error come from, but does not change anything else the fix on the memstore fix the menstore part. there is still another issue, not linked to the memstore but to the flush part (see message above from 23/Aug/11) the test cas may fail for this reason. The probability of a failure is increased by increasing the test count and lowering the flush interval Anyway, the flush issue is already handled in HBASE-2856 . So I think we can consider the patch as ok for the scope of this bug? @Ted : @Stack: do you agree? Do you need more info?
          Hide
          Ted Yu added a comment -

          Clarification for my earlier comments:
          I performed the test (TestHRegion#testWritesWhileGetting) using the patch 20110824_4195_MemStore.patch.

          Show
          Ted Yu added a comment - Clarification for my earlier comments: I performed the test (TestHRegion#testWritesWhileGetting) using the patch 20110824_4195_MemStore.patch.
          Hide
          Nicolas Liochon added a comment -

          With the current implementation, it can fails in TestHRegion at assert "assertEquals("i=" + i, expectedCount, result.size());" when there is a mess up on the lists. It should not occur with the new implementation.

          The issue in the flush is shown by the assert just next to this one. It's (log added in the new patch)

                        LOG.warn("These two KV should have the same value." +
                          " Previous KV:" +
                          previousKV + "(memStoreTS:" + previousKV.getMemstoreTS() + ")" +
                          ", New KV: " +
                          kv + "(memStoreTS:" + kv.getMemstoreTS() + ")"
                        );
                        assertEquals(previousKV.getValue(), thisValue);
          

          With these modified values on testWritesWhileGetting:

              int testCount = 1000; // more iterations. Increase it more if necessary.
              // [...]
              int flushInterval = 2; // more flush
              int compactInterval = 1000 * flushInterval; // no compact (should have no impact, but...)
          

          I produced this error on the trunk + patch proposed:

          [...]
          put iteration = 1927
          put iteration = 2021
          These two KV should have the same value. Previous KV:row0/family4:qual99/1942/Put/vlen=4(memStoreTS:993), New KV: row0/family5:qual0/1944/Put/vlen=4(memStoreTS:0)
          E
          Time: 19.051
          There was 1 failure:
          1) testWritesWhileGetting(org.apache.hadoop.hbase.regionserver.TestHRegion)
          junit.framework.AssertionFailedError: expected:<1942> but was:<1944>
          	at org.apache.hadoop.hbase.HBaseTestCase.assertEquals(HBaseTestCase.java:704)
          	at org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting(TestHRegion.java:2781)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          [...]
          

          May be there are more memory allocations on 0.90 and as a consequence more random interruptions caused by the GC? Note that I am doing the tests outside of Maven, but with JUnit, without specific java parameters.

          If worth checking that on .90 it is as well an error when the family changes (qualifier equals qual0 or qual1). May be it's something different on .90.

          Unfortunatly (or not ) I am on vacations for a week, so I won't be able to give a hand the next days, but I will be back middle of next week.

          Show
          Nicolas Liochon added a comment - With the current implementation, it can fails in TestHRegion at assert "assertEquals("i=" + i, expectedCount, result.size());" when there is a mess up on the lists. It should not occur with the new implementation. The issue in the flush is shown by the assert just next to this one. It's (log added in the new patch) LOG.warn("These two KV should have the same value." + " Previous KV:" + previousKV + "(memStoreTS:" + previousKV.getMemstoreTS() + ")" + ", New KV: " + kv + "(memStoreTS:" + kv.getMemstoreTS() + ")" ); assertEquals(previousKV.getValue(), thisValue); With these modified values on testWritesWhileGetting: int testCount = 1000; // more iterations. Increase it more if necessary. // [...] int flushInterval = 2; // more flush int compactInterval = 1000 * flushInterval; // no compact (should have no impact, but...) I produced this error on the trunk + patch proposed: [...] put iteration = 1927 put iteration = 2021 These two KV should have the same value. Previous KV:row0/family4:qual99/1942/Put/vlen=4(memStoreTS:993), New KV: row0/family5:qual0/1944/Put/vlen=4(memStoreTS:0) E Time: 19.051 There was 1 failure: 1) testWritesWhileGetting(org.apache.hadoop.hbase.regionserver.TestHRegion) junit.framework.AssertionFailedError: expected:<1942> but was:<1944> at org.apache.hadoop.hbase.HBaseTestCase.assertEquals(HBaseTestCase.java:704) at org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting(TestHRegion.java:2781) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [...] May be there are more memory allocations on 0.90 and as a consequence more random interruptions caused by the GC? Note that I am doing the tests outside of Maven, but with JUnit, without specific java parameters. If worth checking that on .90 it is as well an error when the family changes (qualifier equals qual0 or qual1). May be it's something different on .90. Unfortunatly (or not ) I am on vacations for a week, so I won't be able to give a hand the next days, but I will be back middle of next week.
          Hide
          stack added a comment -

          But may be it's not an issue as these put could be skipped anyway by the readpoint criteria ? I just don't know. This state does not last long, as the Store will recreate the scanner when notified of the flush.

          It looks like region scanner takes out a read point on construction and holds to it as scanner runs so I guess neither put would be seen (since they came in after scanner started so will have write points in excess of the read point). I'm just a tourist in this code so I could be wrong.

          N, so the failing TestHRegion#testWritesWhileGetting happens on the 0.90 branch only. Do you want me to try this patch on 0.90? We don't fail on TRUNK (Ted show this the case above and I verified it a while back). How could this be the case when the MemStore is same in both code branches?

          Show
          stack added a comment - But may be it's not an issue as these put could be skipped anyway by the readpoint criteria ? I just don't know. This state does not last long, as the Store will recreate the scanner when notified of the flush. It looks like region scanner takes out a read point on construction and holds to it as scanner runs so I guess neither put would be seen (since they came in after scanner started so will have write points in excess of the read point). I'm just a tourist in this code so I could be wrong. N, so the failing TestHRegion#testWritesWhileGetting happens on the 0.90 branch only. Do you want me to try this patch on 0.90? We don't fail on TRUNK (Ted show this the case above and I verified it a while back). How could this be the case when the MemStore is same in both code branches?
          Hide
          Ted Yu added a comment -

          Consecutively run TestHRegion#testWritesWhileGetting 100 times.
          All of which passed.

          Show
          Ted Yu added a comment - Consecutively run TestHRegion#testWritesWhileGetting 100 times. All of which passed.
          Hide
          Ted Yu added a comment -

          +1 on patch (some extra empty lines should be removed).

          All tests pass.

          Show
          Ted Yu added a comment - +1 on patch (some extra empty lines should be removed). All tests pass.
          Hide
          Nicolas Liochon added a comment -

          TestHRegion: added log to help understand where the problem comes from + modification of the put thread to insert strings (easier to read).

          MemStore: implementation of "reseek" by using sublist set in "seek"

          Show
          Nicolas Liochon added a comment - TestHRegion: added log to help understand where the problem comes from + modification of the put thread to insert strings (easier to read). MemStore: implementation of "reseek" by using sublist set in "seek"
          Hide
          Nicolas Liochon added a comment -

          I tested 2 cases:
          1) The current implementation
          2) A new one (reusing the first one logic without calling seek).

          Times (for public void TestHRegion#testWritesWhileGetting()) are, on a virtualized environment:

          Current - 26s
          Proposed - 13s

          It's a little bit suprising to have a so good result, but it could be explained by the behaviour when the counter gets negative: we can have both a lot of iterations on the list + a call to seek. There is also a lot of gc related time, so it will depends on the memory available.

          But as a conclusion, it seems that it's interesting to use directly the search on the sublist. I will do a patch on the trunk.

          Show
          Nicolas Liochon added a comment - I tested 2 cases: 1) The current implementation 2) A new one (reusing the first one logic without calling seek). Times (for public void TestHRegion#testWritesWhileGetting()) are, on a virtualized environment: Current - 26s Proposed - 13s It's a little bit suprising to have a so good result, but it could be explained by the behaviour when the counter gets negative: we can have both a lot of iterations on the list + a call to seek. There is also a lot of gc related time, so it will depends on the memory available. But as a conclusion, it seems that it's interesting to use directly the search on the sublist. I will do a patch on the trunk.
          Hide
          Nicolas Liochon added a comment -

          The behavior we have with the previous (or with the one I propose)
          implementation is:

          • before the flush, the MemScanner iterator points on the KV Lists of the
            MemStore. So the "put" on the MemStore.kvset will be seen by the scanner.
          • The flush does:
            snapshot = kvset;
            kvset = new SkipList();
          • So after the flush, the "put" will be made on the new Memstore.kvset,
            hence not visible by the existing scanner. So the memScanner behaves
            differently before & after the flush.

          But may be it's not an issue as these put could be skipped anyway by the
          readpoint criteria ? I just don't know. This state does not last long, as
          the Store will recreate the scanner when notified of the flush.

          Show
          Nicolas Liochon added a comment - The behavior we have with the previous (or with the one I propose) implementation is: before the flush, the MemScanner iterator points on the KV Lists of the MemStore. So the "put" on the MemStore.kvset will be seen by the scanner. The flush does: snapshot = kvset; kvset = new SkipList(); So after the flush, the "put" will be made on the new Memstore.kvset, hence not visible by the existing scanner. So the memScanner behaves differently before & after the flush. But may be it's not an issue as these put could be skipped anyway by the readpoint criteria ? I just don't know. This state does not last long, as the Store will recreate the scanner when notified of the flush.
          Hide
          stack added a comment -

          Problem solved then

          Ehh... not exactly (smile)

          For #1, note that the previous implementation will completly ignore the modifications made during the flush: the flush creates a new KV List (kvset), and this list is not seen by the MemScanner, as it uses the previous iterator. I don't know if it's an issue or not.

          This is what we want, right? A consistent view across the flush; definitely not part of 'snapshot' only.

          Your suggested fix sounds good (we'll keep iterators on whatever the sets were when scan started in spite of a flush coming in midway through the scan?)

          Show
          stack added a comment - Problem solved then Ehh... not exactly (smile) For #1, note that the previous implementation will completly ignore the modifications made during the flush: the flush creates a new KV List (kvset), and this list is not seen by the MemScanner, as it uses the previous iterator. I don't know if it's an issue or not. This is what we want, right? A consistent view across the flush; definitely not part of 'snapshot' only. Your suggested fix sounds good (we'll keep iterators on whatever the sets were when scan started in spite of a flush coming in midway through the scan?)
          Hide
          Nicolas Liochon added a comment -

          For #2, yes, it seems that HBASE-2856 is addressing this type of issues. Problem solved then

          For #1, note that the previous implementation will completly ignore the modifications made during the flush: the flush creates a new KV List (kvset), and this list is not seen by the MemScanner, as it uses the previous iterator. I don't know if it's an issue or not.

          This said, I believe it's possible to have an implementation with the same properties as the previous one, with an optimized reseek time, by keeping the pointers to the sublists in the MemStoreScanner. The "reseek" implementation would then become very similar to the "seek" one. I tested this approach, it seems to work functionally as the previous one (i.e. I fail in the case #2 mentionned above). I have not tested the reality of performance improvement, but if there is an aggreement on this approach, I can do it.

          The implementation of reseek would be:

              public synchronized boolean reseek(KeyValue key) {
                  // kvset and snapshot will never be empty.
                  // if tailSet cant find anything, SS is empty (not null).
                  kvTail = kvTail.tailSet(key, true);
                  snapshotTail = snapshotTail.tailSet(key, true);
          
                  kvsetIt = kvTail.iterator();
                  snapshotIt = snapshotTail.iterator();
                  
                  kvsetNextRow = getNext(kvsetIt);
                  snapshotNextRow = getNext(snapshotIt);
          
                  KeyValue lowest = getLowest();
          
                  // has data := (lowest != null)
                  return lowest != null;
              }
          
          Show
          Nicolas Liochon added a comment - For #2, yes, it seems that HBASE-2856 is addressing this type of issues. Problem solved then For #1, note that the previous implementation will completly ignore the modifications made during the flush: the flush creates a new KV List (kvset), and this list is not seen by the MemScanner, as it uses the previous iterator. I don't know if it's an issue or not. This said, I believe it's possible to have an implementation with the same properties as the previous one, with an optimized reseek time, by keeping the pointers to the sublists in the MemStoreScanner. The "reseek" implementation would then become very similar to the "seek" one. I tested this approach, it seems to work functionally as the previous one (i.e. I fail in the case #2 mentionned above). I have not tested the reality of performance improvement, but if there is an aggreement on this approach, I can do it. The implementation of reseek would be: public synchronized boolean reseek(KeyValue key) { // kvset and snapshot will never be empty. // if tailSet cant find anything, SS is empty (not null). kvTail = kvTail.tailSet(key, true); snapshotTail = snapshotTail.tailSet(key, true); kvsetIt = kvTail.iterator(); snapshotIt = snapshotTail.iterator(); kvsetNextRow = getNext(kvsetIt); snapshotNextRow = getNext(snapshotIt); KeyValue lowest = getLowest(); // has data := (lowest != null) return lowest != null; }
          Hide
          stack added a comment -

          #2 above looks like HBASE-2856 (HBASE-2856 is a long issue but skim it and I think you'll find it what you describe here – Amit is working on this one; we chatted at the hackathon yesterday on the issue).

          #1 sounds bad; should we back out the reseek?

          Show
          stack added a comment - #2 above looks like HBASE-2856 ( HBASE-2856 is a long issue but skim it and I think you'll find it what you describe here – Amit is working on this one; we chatted at the hackathon yesterday on the issue). #1 sounds bad; should we back out the reseek?
          Hide
          Nicolas Liochon added a comment -

          Update: I have two other scenarios for failure, both linked to flush occuring during a get.

          1) With the current/optimized implementation of reseek, the set snapshot and kvset can be changed by the thread doing the flush right in the middle of the reseek. This will lead to an unconsistant state.

          A second effect of the flush process creating a new kvset is that the latter writes may not be seen by the MemStore scanner, as it will still be connected to the previous kvset.

          2) More important, and actually not linked to the reseek optimization itself, the following scenario will lead to see a write on multiple families as non atomic.

          t1 : put, it finishes at t2. Write a single row with multiple families.
          t3 : get starts, it finishes at t9
          t4 : the get continues, reads the value for the first family (scanner)
          t5 : put, it finishes at t6. Change the values of the row previously written.
          t7 : flush start, it finishes at t8.
          t9 : the get continues, reads the value for the second family (scanner) from the FileScanner
          t10: get finishes

          In this case, the get will have the values of the first write for the first families, and the value of the second write for the last families.

          This is due to the fact that the flush process create a file, and notifies the scanners. The scanners then refreshes their view. The notification is "java synchronized" with the next in the StoreFileScanner, so it does not happen during a scan within a family, but it occurs between the families within a read, as there is one scanner per family. If you add a read lock in the next() (on HRegion.updatelock), the problem does not occur, as the flush will not take place during a read.

          As it's a random bug, there can be other scenarios. In my environnement, when there is a failure:

          • it's always with a KV with a memstoreTS equals to 0
          • the column is always "qual1"

          As said, this second is actually not linked to the modifications on "MemStoreScanner#reseek", but is linked to the flush/get parallel execution. I would tend to think that the issue happens in production as well.

          Show
          Nicolas Liochon added a comment - Update: I have two other scenarios for failure, both linked to flush occuring during a get. 1) With the current/optimized implementation of reseek, the set snapshot and kvset can be changed by the thread doing the flush right in the middle of the reseek. This will lead to an unconsistant state. A second effect of the flush process creating a new kvset is that the latter writes may not be seen by the MemStore scanner, as it will still be connected to the previous kvset. 2) More important, and actually not linked to the reseek optimization itself, the following scenario will lead to see a write on multiple families as non atomic. t1 : put, it finishes at t2. Write a single row with multiple families. t3 : get starts, it finishes at t9 t4 : the get continues, reads the value for the first family (scanner) t5 : put, it finishes at t6. Change the values of the row previously written. t7 : flush start, it finishes at t8. t9 : the get continues, reads the value for the second family (scanner) from the FileScanner t10: get finishes In this case, the get will have the values of the first write for the first families, and the value of the second write for the last families. This is due to the fact that the flush process create a file, and notifies the scanners. The scanners then refreshes their view. The notification is "java synchronized" with the next in the StoreFileScanner, so it does not happen during a scan within a family, but it occurs between the families within a read, as there is one scanner per family. If you add a read lock in the next() (on HRegion.updatelock), the problem does not occur, as the flush will not take place during a read. As it's a random bug, there can be other scenarios. In my environnement, when there is a failure: it's always with a KV with a memstoreTS equals to 0 the column is always "qual1" As said, this second is actually not linked to the modifications on "MemStoreScanner#reseek", but is linked to the flush/get parallel execution. I would tend to think that the issue happens in production as well.
          Hide
          Nicolas Liochon added a comment -

          I can do a simple patch (removing all the code around numIterReseek). However, it would conflict with the patch for HBASE-4188/HBASE-1938. Is it possible for you to commit this one first?

          Note that I have been able to make this reseek implementation fails as well by adding a Thread.sleep between the search on the two iterators. In other words, there is a race condition somewhere. It could be a conflict with the "flush" process. I noticed that a flush cannot happen during a put (lock on hregion.update) or a seek (lock on store), but there is nothing to prevent a reseek to take place during the snapshot. But I don't how long it will take to find the real issue behind all this, so a partial fix lowering the probability of having an issue makes sense...

          Show
          Nicolas Liochon added a comment - I can do a simple patch (removing all the code around numIterReseek). However, it would conflict with the patch for HBASE-4188 / HBASE-1938 . Is it possible for you to commit this one first? Note that I have been able to make this reseek implementation fails as well by adding a Thread.sleep between the search on the two iterators. In other words, there is a race condition somewhere. It could be a conflict with the "flush" process. I noticed that a flush cannot happen during a put (lock on hregion.update) or a seek (lock on store), but there is nothing to prevent a reseek to take place during the snapshot. But I don't how long it will take to find the real issue behind all this, so a partial fix lowering the probability of having an issue makes sense...
          Show
          Ted Yu added a comment - TestHRegion.testWritesWhileGetting failed in build 266: https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/lastCompletedBuild/testReport/org.apache.hadoop.hbase.regionserver/TestHRegion/testWritesWhileGetting/
          Hide
          stack added a comment -

          Can someone confirm that we should not see partial writes in this case?

          If we are seeing partial writes then we are not paying attention to RWCC sequence numbers properly. The refactored reseek is probably how the code was before hbase-3855 (I didn't check) and that code was around when RWCC was being worked out is my guess... so it probably 'worked' (not seeing parital writes). The new patch adding seeks may not be watching RWCC properly.

          You've done some nice analysis above N. Any chance of your seeing it home? I will run my dumb TestHRegion test if you have any patch you'd like me to try.

          Show
          stack added a comment - Can someone confirm that we should not see partial writes in this case? If we are seeing partial writes then we are not paying attention to RWCC sequence numbers properly. The refactored reseek is probably how the code was before hbase-3855 (I didn't check) and that code was around when RWCC was being worked out is my guess... so it probably 'worked' (not seeing parital writes). The new patch adding seeks may not be watching RWCC properly. You've done some nice analysis above N. Any chance of your seeing it home? I will run my dumb TestHRegion test if you have any patch you'd like me to try.
          Hide
          Nicolas Liochon added a comment -

          Do you want me do write the patch?

          You confirm that we should not see a partial write?

          Show
          Nicolas Liochon added a comment - Do you want me do write the patch? You confirm that we should not see a partial write?
          Hide
          Ted Yu added a comment -

          I think that will do the trick.
          I propose setting RESEEKMAX_DEFAULT to -1.

          Show
          Ted Yu added a comment - I think that will do the trick. I propose setting RESEEKMAX_DEFAULT to -1.
          Hide
          Nicolas Liochon added a comment -

          With the current implementation, setting the config RESEEKMAX_KEY to -1 (read with conf.getInt(RESEEKMAX_KEY, RESEEKMAX_DEFAULT) will have this effect. disclaimer: i did not test it.

          Show
          Nicolas Liochon added a comment - With the current implementation, setting the config RESEEKMAX_KEY to -1 (read with conf.getInt(RESEEKMAX_KEY, RESEEKMAX_DEFAULT) will have this effect. disclaimer: i did not test it.
          Hide
          Ted Yu added a comment -

          I think the feature of reseek() selectively calling seek() should be governed by a config parameter until total understanding of the intricacies is reached.

          Show
          Ted Yu added a comment - I think the feature of reseek() selectively calling seek() should be governed by a config parameter until total understanding of the intricacies is reached.
          Hide
          Nicolas Liochon added a comment -

          The issue with the implementation calling only seek is that we can see "writes in progress". From my understanding, it should not be the case (and at least, if it's allowed, there is an issue in the test case itself).

          The error is this assert: Assert.assertEquals("i=" + i, expectedCount, result.size());, that's different from the one mentionned in HBASE-3855.

          If I change the reseek implementation to something that does no call seek at all, like:

              public boolean reseek(KeyValue key) {
                while (kvsetNextRow != null &&
                  comparator.compare(kvsetNextRow, key) < 0) {
                  kvsetNextRow = getNext(kvsetIt);
                }
          
                while (snapshotNextRow != null &&
                  comparator.compare(snapshotNextRow, key) < 0) {
                  snapshotNextRow = getNext(snapshotIt);
                }
          
                numIterReseek = 0;
                return (kvsetNextRow != null || snapshotNextRow != null);
              }

          The whole test works fine. So it seems the issue really comes from using seek. The current implementation should have the same issue I think. May be we don't see it often (or at all) because seek is not called that often because of the points mentionned in 2 & 3 in the analysis above.

          Can someone confirm that we should not see partial writes in this case?

          Show
          Nicolas Liochon added a comment - The issue with the implementation calling only seek is that we can see "writes in progress". From my understanding, it should not be the case (and at least, if it's allowed, there is an issue in the test case itself). The error is this assert: Assert.assertEquals("i=" + i, expectedCount, result.size());, that's different from the one mentionned in HBASE-3855 . If I change the reseek implementation to something that does no call seek at all, like: public boolean reseek(KeyValue key) { while (kvsetNextRow != null && comparator.compare(kvsetNextRow, key) < 0) { kvsetNextRow = getNext(kvsetIt); } while (snapshotNextRow != null && comparator.compare(snapshotNextRow, key) < 0) { snapshotNextRow = getNext(snapshotIt); } numIterReseek = 0; return (kvsetNextRow != null || snapshotNextRow != null); } The whole test works fine. So it seems the issue really comes from using seek. The current implementation should have the same issue I think. May be we don't see it often (or at all) because seek is not called that often because of the points mentionned in 2 & 3 in the analysis above. Can someone confirm that we should not see partial writes in this case?
          Hide
          Nicolas Liochon added a comment -

          FWIW, I also tried to change reseek to call "seek" all the time, but I have strange results (i.e. errors). I don't know if we can swap a loop of "next" with a single call to seek.Here is the implementation I tried;

              public boolean reseek(KeyValue key) {
                return seek(key);
              }
          

          But the test case fails (randomly). So there could be more than one issue.

          Show
          Nicolas Liochon added a comment - FWIW, I also tried to change reseek to call "seek" all the time, but I have strange results (i.e. errors). I don't know if we can swap a loop of "next" with a single call to seek.Here is the implementation I tried; public boolean reseek(KeyValue key) { return seek(key); } But the test case fails (randomly). So there could be more than one issue.

            People

            • Assignee:
              Nicolas Liochon
              Reporter:
              Nicolas Liochon
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development