Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.0-alpha-1, 2.4.8
-
None
-
Reviewed
Description
TestHStore.testCompactingMemStoreCellExceedInmemoryFlushSize sometimes is failed and the error stack is :
org.apache.hadoop.hbase.DroppedSnapshotException: Failed clearing memory after 6 attempts on region: table,,1636366699236.cc87b170ebf180005d5c4bf09d404b3d. at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1788) at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1595) at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1563) at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1528) at org.apache.hadoop.hbase.regionserver.TestHStore.tearDown(TestHStore.java:661) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.RunAfters.invokeMethod(RunAfters.java:46) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.apache.hadoop.hbase.SystemExitRule$1.evaluate(SystemExitRule.java:38) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748)
If I modify the test code to make the test method finished after waiting for in memory compacting completed, the test would fail all the time with the same error stack, and if the test method finished before the in memory compacting start, the test would always success.
Why the test would fail after the in memory compacting completed is somewhat subtle:
1. Because TestHStore only test HStore , so in following TestHStore.init method the new store is created separately and does not added to HRegion.stores,
private HStore init(String methodName, Configuration conf, TableDescriptorBuilder builder, ColumnFamilyDescriptor hcd, MyStoreHook hook, boolean switchToPread) throws IOException { initHRegion(methodName, conf, builder, hcd, hook, switchToPread); if (hook == null) { store = new HStore(region, hcd, conf, false); } else { store = new MyStore(region, hcd, conf, hook, switchToPread); } return store; }
2.When cell is added to Store,HStore.add method would not increase HRegion.memStoreSizing, HRegion.memStoreSizing is always 0, but when in memory compaction, the cell is forced to copy to MSLAB by CellChunkImmutableSegment.copyCellIntoMSLAB in following line 218 in CellChunkImmutableSegment.reinitializeCellSet ,so the type of the cell is changed from KeyValue to NoTagByteBufferChunkKeyValue,which size increased 4 byte and HRegion.memStoreSizing is 4 :
211 try { 212 while ((curCell = segmentScanner.next()) != null) { 213 assert(curCell instanceof ExtendedCell); 214 if (((ExtendedCell)curCell).getChunkId() == ExtendedCell.CELL_NOT_BASED_ON_CHUNK) { 215 // CellChunkMap assumes all cells are allocated on MSLAB. 216 // Therefore, cells which are not allocated on MSLAB initially, 217 // are copied into MSLAB here. 218 curCell = copyCellIntoMSLAB(curCell, memstoreSizing); 219 }
3. When the region close, following HRegion.doClose finds that HRegion.memStoreSizing is not zero in line 1772 and tries to flush the region,but the HRegion.stores is empty so flushing takes no effect, when the failedfFlushCount bigger than 5, DroppedSnapshotException is thrown.
1771 long remainingSize = this.memStoreSizing.getDataSize(); 1772 while (remainingSize > 0) { 1773 try { 1774 internalFlushcache(status); 1775 if(flushCount >0) { 1776 LOG.info("Running extra flush, " + flushCount + 1777 " (carrying snapshot?) " + this); 1778 } 1779 flushCount++; 1780 tmp = this.memStoreSizing.getDataSize(); 1781 if (tmp >= remainingSize) { 1782 failedfFlushCount++; 1783 } 1784 remainingSize = tmp; 1785 if (failedfFlushCount > 5) { 1786 // If we failed 5 times and are unable to clear memory, abort 1787 // so we do not lose data 1788 throw new DroppedSnapshotException("Failed clearing memory after " + 1789 flushCount + " attempts on region: " + 1790 Bytes.toStringBinary(getRegionInfo().getRegionName())); 1791 }
My fix is adding the new created HStore to HRegion.stores in TestHStore.init
Attachments
Issue Links
- links to