Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-20090

Properly handle Preconditions check failure in MemStoreFlusher$FlushHandler.run

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • None
    • None
    • Reviewed
    • Hide
      When there is concurrent region split, MemStoreFlusher may not find flushable region if the only candidate region left hasn't received writes (resulting in 0 data size).
      After this JIRA, such scenario wouldn't trigger Precondition assertion (replaced by an if statement to see whether there is any flushable region).
      If there is no flushable region, a DEBUG log would appear in region server log, saying "Above memory mark but there is no flushable region".
      Show
      When there is concurrent region split, MemStoreFlusher may not find flushable region if the only candidate region left hasn't received writes (resulting in 0 data size). After this JIRA, such scenario wouldn't trigger Precondition assertion (replaced by an if statement to see whether there is any flushable region). If there is no flushable region, a DEBUG log would appear in region server log, saying "Above memory mark but there is no flushable region".

    Description

      Copied the following from a comment since this was better description of the race condition.
      The original description was merged to the beginning of my first comment below.

      Observed the following in region server log (running on hadoop3 cluster):

      2018-02-26 16:06:50,044 WARN  [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=16020] regionserver.MemStoreFlusher: Memstore is above high water mark and block 135ms
      
      2018-02-26 16:06:50,049 ERROR [MemStoreFlusher.1] regionserver.MemStoreFlusher: Cache flusher failed for entry org.apache.hadoop.hbase.regionserver.                              MemStoreFlusher$WakeupFlushThread@2adfadd7
      java.lang.IllegalStateException
              at org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:441)
              at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:174)
              at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$600(MemStoreFlusher.java:69)
              at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:237)
              at java.lang.Thread.run(Thread.java:748)
      

      Here is the Precondition from MemStoreFlusher#flushOneForGlobalPressure() :

            Preconditions.checkState(
              (regionToFlush != null && regionToFlushSize > 0) || bestRegionReplicaSize > 0);
      

      When the Preconditions check fails, IllegalStateException would be raised.

      With more debug logging, we can see the scenario where the exception was triggered.

      2018-03-02 17:28:30,097 DEBUG [MemStoreFlusher.0] regionserver.CompactSplit: Splitting TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., compaction_queue=(0:0),     split_queue=1
      2018-03-02 17:28:30,098 DEBUG [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info      size=6.9G, sizeToCheck=256.0M, regionsWithCommonTable=1
      2018-03-02 17:28:30,296 INFO  [RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
      2018-03-02 17:28:30,297 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=381.5 M
      2018-03-02 17:28:30,297 INFO  [RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
      2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: region TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085. with size 400432696
      2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. with size 0
      2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: Flush of region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. due to global     heap pressure. Flush type=ABOVE_ONHEAP_LOWER_MARKTotal Memstore Heap size=381.9 MTotal Memstore Off-Heap size=0, Region memstore size=0
      2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: wake up by WAKEUPFLUSH_INSTANCE
      2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: Nothing to flush for atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae.
      2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: Excluding unflushable region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. -    trying to find a different region to flush.
      

      Region 0453f29030757eedb6e6a1c57e88c085 was being split.
      In HRegion#flushcache, the log from else branch can be seen in 20090-server-61260-01-000007.log :

            synchronized (writestate) {
              if (!writestate.flushing && writestate.writesEnabled) {
                this.writestate.flushing = true;
              } else {
                if (LOG.isDebugEnabled()) {
                  LOG.debug("NOT flushing memstore for region " + this
                      + ", flushing=" + writestate.flushing + ", writesEnabled="
                      + writestate.writesEnabled);
                }
      

      Meaning, region 0453f29030757eedb6e6a1c57e88c085 couldn't flush, leaving memory pressure at high level.
      When MemStoreFlusher ran to the following call, the region was no longer a flush candidate:

            HRegion bestFlushableRegion =
                getBiggestMemStoreRegion(regionsBySize, excludedRegions, true);
      

      So the other region, atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. , was examined next. Since the region was not receiving write, the (current) Precondition check failed.

      The proposed fix is to convert the Precondition to normal return.

      Attachments

        1. 20090-server-61260-01-000007.log
          7 kB
          Ted Yu
        2. 20090.v9.txt
          1 kB
          Ted Yu
        3. 20090.v8.txt
          1 kB
          Ted Yu
        4. 20090.v7.txt
          2 kB
          Ted Yu
        5. 20090.v6.txt
          2 kB
          Ted Yu
        6. 20090.v10.txt
          2 kB
          Ted Yu
        7. 20090.v10.txt
          2 kB
          Ted Yu

        Activity

          People

            yuzhihong@gmail.com Ted Yu
            yuzhihong@gmail.com Ted Yu
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: