Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-26438

Fix flaky test TestHStore.testCompactingMemStoreCellExceedInmemoryFlushSize

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0-alpha-1, 2.4.8
    • 2.5.0, 3.0.0-alpha-2, 2.4.9
    • test
    • None
    • Reviewed

    Description

      TestHStore.testCompactingMemStoreCellExceedInmemoryFlushSize sometimes is failed and the error stack is :

      org.apache.hadoop.hbase.DroppedSnapshotException: Failed clearing memory after 6 attempts on region: table,,1636366699236.cc87b170ebf180005d5c4bf09d404b3d.
      	at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1788)
      	at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1595)
      	at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1563)
      	at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1528)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.tearDown(TestHStore.java:661)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
      	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
      	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
      	at org.junit.internal.runners.statements.RunAfters.invokeMethod(RunAfters.java:46)
      	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
      	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
      	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
      	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
      	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
      	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
      	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
      	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
      	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
      	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
      	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
      	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
      	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
      	at org.apache.hadoop.hbase.SystemExitRule$1.evaluate(SystemExitRule.java:38)
      	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
      	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.lang.Thread.run(Thread.java:748)
      

      If I modify the test code to make the test method finished after waiting for in memory compacting completed, the test would fail all the time with the same error stack, and if the test method finished before the in memory compacting start, the test would always success.
      Why the test would fail after the in memory compacting completed is somewhat subtle:
      1. Because TestHStore only test HStore , so in following TestHStore.init method the new store is created separately and does not added to HRegion.stores

      private HStore init(String methodName, Configuration conf, TableDescriptorBuilder builder,
            ColumnFamilyDescriptor hcd, MyStoreHook hook, boolean switchToPread) throws IOException {
          initHRegion(methodName, conf, builder, hcd, hook, switchToPread);
          if (hook == null) {
            store = new HStore(region, hcd, conf, false);
          } else {
            store = new MyStore(region, hcd, conf, hook, switchToPread);
          }
          return store;
        }
      
      

      2.When cell is added to Store,HStore.add method would not increase HRegion.memStoreSizing, HRegion.memStoreSizing is always 0, but when in memory compaction, the cell is forced to copy to MSLAB by CellChunkImmutableSegment.copyCellIntoMSLAB in following line 218 in CellChunkImmutableSegment.reinitializeCellSet ,so the type of the cell is changed from KeyValue to NoTagByteBufferChunkKeyValue,which size increased 4 byte and HRegion.memStoreSizing is 4 :

       211  try {
       212    while ((curCell = segmentScanner.next()) != null) {
       213      assert(curCell instanceof ExtendedCell);
       214      if (((ExtendedCell)curCell).getChunkId() == ExtendedCell.CELL_NOT_BASED_ON_CHUNK) {
       215        // CellChunkMap assumes all cells are allocated on MSLAB.
       216         // Therefore, cells which are not allocated on MSLAB initially,
       217          // are copied into MSLAB here.
       218         curCell = copyCellIntoMSLAB(curCell, memstoreSizing);
       219      }
      

      3. When the region close, following HRegion.doClose finds that HRegion.memStoreSizing is not zero in line 1772 and tries to flush the region,but the HRegion.stores is empty so flushing takes no effect, when the failedfFlushCount bigger than 5, DroppedSnapshotException is thrown.

      1771       long remainingSize = this.memStoreSizing.getDataSize();
      1772        while (remainingSize > 0) {
      1773          try {
      1774            internalFlushcache(status);
      1775            if(flushCount >0) {
      1776              LOG.info("Running extra flush, " + flushCount +
      1777                  " (carrying snapshot?) " + this);
      1778            }
      1779            flushCount++;
      1780            tmp = this.memStoreSizing.getDataSize();
      1781            if (tmp >= remainingSize) {
      1782              failedfFlushCount++;
      1783            }
      1784            remainingSize = tmp;
      1785            if (failedfFlushCount > 5) {
      1786              // If we failed 5 times and are unable to clear memory, abort
      1787              // so we do not lose data
      1788              throw new DroppedSnapshotException("Failed clearing memory after " +
      1789                  flushCount + " attempts on region: " +
      1790                  Bytes.toStringBinary(getRegionInfo().getRegionName()));
      1791            }
      

      My fix is adding the new created HStore to HRegion.stores in TestHStore.init

      Attachments

        Issue Links

          Activity

            People

              comnetwork chenglei
              comnetwork chenglei
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: