HBase
  1. HBase
  2. HBASE-2646

Compaction requests should be prioritized to prevent blocking

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.20.4
    • Fix Version/s: 0.90.0
    • Component/s: regionserver
    • Labels:
    • Environment:

      ubuntu server 10; hbase 0.20.4; 4 machine cluster (each machine is an 8 core xeon with 16 GB of ram and 6TB of storage); ~250 Million rows;

    • Hadoop Flags:
      Reviewed

      Description

      While testing the write capacity of a 4 machine hbase cluster we were getting long and frequent client pauses as we attempted to load the data. Looking into the problem we'd get a relatively large compaction queue and when a region hit the "hbase.hstore.blockingStoreFiles" limit it would get block the client and the compaction request would get put on the back of the queue waiting for many other less important compactions. The client is basically stuck at that point until a compaction is done. Prioritizing the compaction requests and allowing the request that is blocking other actions go first would help solve the problem.

      You can see the problem by looking at our log files:

      You'll first see an event such as a too many HLog which will put a lot of requests on the compaction queue.

      2010-05-25 10:53:26,570 INFO org.apache.hadoop.hbase.regionserver.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 22 regions(s): responseCounts,RS_6eZzLtdwhGiTwHy,1274232223324, responses,RS_0qhkL5rUmPCbx3K-1274213057242,1274513189592, responses,RS_1ANYnTegjzVIsHW-12742177419
      21,1274511001873, responses,RS_1HQ4UG5BdOlAyuE-1274216757425,1274726323747, responses,RS_1Y7SbqSTsZrYe7a-1274328697838,1274478031930, responses,RS_1ZH5TB5OdW4BVLm-1274216239894,1274538267659, responses,RS_3BHc4KyoM3q72Yc-1274290546987,1274502062319, responses,RS_3ra9BaBMAXFAvbK-127421457
      9958,1274381552543, responses,RS_6SDrGNuyyLd3oR6-1274219941155,1274385453586, responses,RS_8AGCEMWbI6mZuoQ-1274306857429,1274319602718, responses,RS_8C8T9DN47uwTG1S-1274215381765,1274289112817, responses,RS_8J5wmdmKmJXzK6g-1274299593861,1274494738952, responses,RS_8e5Sz0HeFPAdb6c-1274288
      641459,1274495868557, responses,RS_8rjcnmBXPKzI896-1274306981684,1274403047940, responses,RS_9FS3VedcyrF0KX2-1274245971331,1274754745013, responses,RS_9oZgPtxO31npv3C-1274214027769,1274396489756, responses,RS_a3FdO2jhqWuy37C-1274209228660,1274399508186, responses,RS_a3LJVxwTj29MHVa-12742
      

      Then you see the too many log files:

      2010-05-25 10:53:31,364 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862/783020138 because: regionserver/192.168.0.81:60020.cacheFlusher
      2010-05-25 10:53:32,364 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862 has too many store files, putting it back at the end of the flush queue.
      

      Which leads to this:

      2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 60 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
      2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 84 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
      2010-05-25 10:53:27,065 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 1 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
      

      Once the compaction / split is done a flush is able to happen which unblocks the IPC allowing writes to continue. Unfortunately this process can take upwards of 15+ minutes (the specific case shown here from our logs took about 4 minutes).

      1. 2646-fix-race-condition-r1004349.txt
        4 kB
        Jeff Whiting
      2. 2646-v2.txt
        35 kB
        stack
      3. 2646-v3.txt
        9 kB
        stack
      4. prioritycompactionqueue-0.20.4.patch
        27 kB
        Jeff Whiting
      5. PriorityQueue-r996664.patch
        24 kB
        Jeff Whiting

        Issue Links

          Activity

          Jeff Whiting created issue -
          Jeff Whiting made changes -
          Field Original Value New Value
          Description While testing the write capacity of a 4 machine hbase cluster we were getting long and frequent client pauses as we attempted to load the data. Looking into the problem we'd get a relatively large compaction queue and when a region hit the "hbase.hstore.blockingStoreFiles" limit it would get block the client and the compaction request would get put on the back of the queue waiting for many other less important compactions. The client is basically stuck at that point until a compaction is done. Prioritizing the compaction requests and allowing the request that is blocking other actions go first would help solve the problem.

          You can see the problem by looking at our log files:

          You'll first see an event such as a too many HLog which will put a lot of requests on the compaction queue.
          {noformat}
          2010-05-25 10:53:26,570 INFO org.apache.hadoop.hbase.regionserver.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 22 regions(s): responseCounts,RS_6eZzLtdwhGiTwHy,1274232223324, responses,RS_0qhkL5rUmPCbx3K-1274213057242,1274513189592, responses,RS_1ANYnTegjzVIsHW-12742177419
          21,1274511001873, responses,RS_1HQ4UG5BdOlAyuE-1274216757425,1274726323747, responses,RS_1Y7SbqSTsZrYe7a-1274328697838,1274478031930, responses,RS_1ZH5TB5OdW4BVLm-1274216239894,1274538267659, responses,RS_3BHc4KyoM3q72Yc-1274290546987,1274502062319, responses,RS_3ra9BaBMAXFAvbK-127421457
          9958,1274381552543, responses,RS_6SDrGNuyyLd3oR6-1274219941155,1274385453586, responses,RS_8AGCEMWbI6mZuoQ-1274306857429,1274319602718, responses,RS_8C8T9DN47uwTG1S-1274215381765,1274289112817, responses,RS_8J5wmdmKmJXzK6g-1274299593861,1274494738952, responses,RS_8e5Sz0HeFPAdb6c-1274288
          641459,1274495868557, responses,RS_8rjcnmBXPKzI896-1274306981684,1274403047940, responses,RS_9FS3VedcyrF0KX2-1274245971331,1274754745013, responses,RS_9oZgPtxO31npv3C-1274214027769,1274396489756, responses,RS_a3FdO2jhqWuy37C-1274209228660,1274399508186, responses,RS_a3LJVxwTj29MHVa-12742
          {/noformat}

          Then you see the too many log files:
          {noformat}
          2010-05-25 10:53:31,364 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862/783020138 because: regionserver/192.168.0.81:60020.cacheFlusher
          2010-05-25 10:53:32,364 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862 has too many store files, putting it back at the end of the flush queue.
          {/noformat}

          Which leads to this:
          {noformat}
          2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 60 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
          2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 84 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
          2010-05-25 10:53:27,065 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 1 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
          {/noformat}



          Once the compaction / split is done a flush is able to happen which unblocks the IPC allowing writes to continue. Unfortunately this process can take upwards of 15+ minutes (the specific case shown here from our logs took about 4 minutes).
          While testing the write capacity of a 4 machine hbase cluster we were getting long and frequent client pauses as we attempted to load the data. Looking into the problem we'd get a relatively large compaction queue and when a region hit the "hbase.hstore.blockingStoreFiles" limit it would get block the client and the compaction request would get put on the back of the queue waiting for many other less important compactions. The client is basically stuck at that point until a compaction is done. Prioritizing the compaction requests and allowing the request that is blocking other actions go first would help solve the problem.

          You can see the problem by looking at our log files:

          You'll first see an event such as a too many HLog which will put a lot of requests on the compaction queue.
          {noformat}
          2010-05-25 10:53:26,570 INFO org.apache.hadoop.hbase.regionserver.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 22 regions(s): responseCounts,RS_6eZzLtdwhGiTwHy,1274232223324, responses,RS_0qhkL5rUmPCbx3K-1274213057242,1274513189592, responses,RS_1ANYnTegjzVIsHW-12742177419
          21,1274511001873, responses,RS_1HQ4UG5BdOlAyuE-1274216757425,1274726323747, responses,RS_1Y7SbqSTsZrYe7a-1274328697838,1274478031930, responses,RS_1ZH5TB5OdW4BVLm-1274216239894,1274538267659, responses,RS_3BHc4KyoM3q72Yc-1274290546987,1274502062319, responses,RS_3ra9BaBMAXFAvbK-127421457
          9958,1274381552543, responses,RS_6SDrGNuyyLd3oR6-1274219941155,1274385453586, responses,RS_8AGCEMWbI6mZuoQ-1274306857429,1274319602718, responses,RS_8C8T9DN47uwTG1S-1274215381765,1274289112817, responses,RS_8J5wmdmKmJXzK6g-1274299593861,1274494738952, responses,RS_8e5Sz0HeFPAdb6c-1274288
          641459,1274495868557, responses,RS_8rjcnmBXPKzI896-1274306981684,1274403047940, responses,RS_9FS3VedcyrF0KX2-1274245971331,1274754745013, responses,RS_9oZgPtxO31npv3C-1274214027769,1274396489756, responses,RS_a3FdO2jhqWuy37C-1274209228660,1274399508186, responses,RS_a3LJVxwTj29MHVa-12742
          {noformat}

          Then you see the too many log files:
          {noformat}
          2010-05-25 10:53:31,364 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862/783020138 because: regionserver/192.168.0.81:60020.cacheFlusher
          2010-05-25 10:53:32,364 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862 has too many store files, putting it back at the end of the flush queue.
          {noformat}

          Which leads to this:
          {noformat}
          2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 60 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
          2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 84 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
          2010-05-25 10:53:27,065 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 1 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
          {noformat}



          Once the compaction / split is done a flush is able to happen which unblocks the IPC allowing writes to continue. Unfortunately this process can take upwards of 15+ minutes (the specific case shown here from our logs took about 4 minutes).
          Jeff Whiting made changes -
          Attachment prioritycompactionqueue-0.20.4.patch [ 12446048 ]
          Jonathan Gray made changes -
          Fix Version/s 0.21.0 [ 12313607 ]
          Priority Major [ 3 ] Critical [ 2 ]
          Jeff Hammerbacher made changes -
          Link This issue is related to HBASE-2375 [ HBASE-2375 ]
          Jonathan Gray made changes -
          Link This issue relates to HBASE-2832 [ HBASE-2832 ]
          Todd Lipcon made changes -
          Link This issue is related to HBASE-2981 [ HBASE-2981 ]
          Jeff Whiting made changes -
          Attachment PriorityQueue-r996664.patch [ 12454478 ]
          Andrew Purtell made changes -
          Fix Version/s 0.20.7 [ 12315195 ]
          stack made changes -
          Attachment 2646-v2.txt [ 12455780 ]
          stack made changes -
          Attachment 2646-v3.txt [ 12455798 ]
          stack made changes -
          Fix Version/s 0.90.0 [ 12313607 ]
          Jeff Whiting made changes -
          Attachment 2464-fix-race-condition-r1004349.txt [ 12456303 ]
          Jeff Whiting made changes -
          Attachment 2464-fix-race-condition-r1004349.txt [ 12456303 ]
          Jeff Whiting made changes -
          Attachment 2646-fix-race-condition-r1004349.txt [ 12456304 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to HBASE-3160 [ HBASE-3160 ]
          stack made changes -
          Fix Version/s 0.20.7 [ 12315195 ]
          stack made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Assignee Jeff Whiting [ whitingj ]
          Fix Version/s 0.90.0 [ 12313607 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              Jeff Whiting
              Reporter:
              Jeff Whiting
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development