Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Reviewed
-
Description
Copied the following from a comment since this was better description of the race condition.
The original description was merged to the beginning of my first comment below.
Observed the following in region server log (running on hadoop3 cluster):
2018-02-26 16:06:50,044 WARN [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=16020] regionserver.MemStoreFlusher: Memstore is above high water mark and block 135ms 2018-02-26 16:06:50,049 ERROR [MemStoreFlusher.1] regionserver.MemStoreFlusher: Cache flusher failed for entry org.apache.hadoop.hbase.regionserver. MemStoreFlusher$WakeupFlushThread@2adfadd7 java.lang.IllegalStateException at org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:441) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:174) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$600(MemStoreFlusher.java:69) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:237) at java.lang.Thread.run(Thread.java:748)
Here is the Precondition from MemStoreFlusher#flushOneForGlobalPressure() :
Preconditions.checkState(
(regionToFlush != null && regionToFlushSize > 0) || bestRegionReplicaSize > 0);
When the Preconditions check fails, IllegalStateException would be raised.
With more debug logging, we can see the scenario where the exception was triggered.
2018-03-02 17:28:30,097 DEBUG [MemStoreFlusher.0] regionserver.CompactSplit: Splitting TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., compaction_queue=(0:0), split_queue=1 2018-03-02 17:28:30,098 DEBUG [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info size=6.9G, sizeToCheck=256.0M, regionsWithCommonTable=1 2018-03-02 17:28:30,296 INFO [RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK 2018-03-02 17:28:30,297 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=381.5 M 2018-03-02 17:28:30,297 INFO [RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK 2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: region TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085. with size 400432696 2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. with size 0 2018-03-02 17:28:30,298 INFO [MemStoreFlusher.1] regionserver.MemStoreFlusher: Flush of region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. due to global heap pressure. Flush type=ABOVE_ONHEAP_LOWER_MARKTotal Memstore Heap size=381.9 MTotal Memstore Off-Heap size=0, Region memstore size=0 2018-03-02 17:28:30,298 INFO [MemStoreFlusher.1] regionserver.MemStoreFlusher: wake up by WAKEUPFLUSH_INSTANCE 2018-03-02 17:28:30,298 INFO [MemStoreFlusher.1] regionserver.MemStoreFlusher: Nothing to flush for atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. 2018-03-02 17:28:30,298 INFO [MemStoreFlusher.1] regionserver.MemStoreFlusher: Excluding unflushable region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. - trying to find a different region to flush.
Region 0453f29030757eedb6e6a1c57e88c085 was being split.
In HRegion#flushcache, the log from else branch can be seen in 20090-server-61260-01-000007.log :
synchronized (writestate) { if (!writestate.flushing && writestate.writesEnabled) { this.writestate.flushing = true; } else { if (LOG.isDebugEnabled()) { LOG.debug("NOT flushing memstore for region " + this + ", flushing=" + writestate.flushing + ", writesEnabled=" + writestate.writesEnabled); }
Meaning, region 0453f29030757eedb6e6a1c57e88c085 couldn't flush, leaving memory pressure at high level.
When MemStoreFlusher ran to the following call, the region was no longer a flush candidate:
HRegion bestFlushableRegion =
getBiggestMemStoreRegion(regionsBySize, excludedRegions, true);
So the other region, atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. , was examined next. Since the region was not receiving write, the (current) Precondition check failed.
The proposed fix is to convert the Precondition to normal return.