HBase
  1. HBase
  2. HBASE-2646

Compaction requests should be prioritized to prevent blocking

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.20.4
    • Fix Version/s: 0.90.0
    • Component/s: regionserver
    • Labels:
    • Environment:

      ubuntu server 10; hbase 0.20.4; 4 machine cluster (each machine is an 8 core xeon with 16 GB of ram and 6TB of storage); ~250 Million rows;

    • Hadoop Flags:
      Reviewed

      Description

      While testing the write capacity of a 4 machine hbase cluster we were getting long and frequent client pauses as we attempted to load the data. Looking into the problem we'd get a relatively large compaction queue and when a region hit the "hbase.hstore.blockingStoreFiles" limit it would get block the client and the compaction request would get put on the back of the queue waiting for many other less important compactions. The client is basically stuck at that point until a compaction is done. Prioritizing the compaction requests and allowing the request that is blocking other actions go first would help solve the problem.

      You can see the problem by looking at our log files:

      You'll first see an event such as a too many HLog which will put a lot of requests on the compaction queue.

      2010-05-25 10:53:26,570 INFO org.apache.hadoop.hbase.regionserver.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 22 regions(s): responseCounts,RS_6eZzLtdwhGiTwHy,1274232223324, responses,RS_0qhkL5rUmPCbx3K-1274213057242,1274513189592, responses,RS_1ANYnTegjzVIsHW-12742177419
      21,1274511001873, responses,RS_1HQ4UG5BdOlAyuE-1274216757425,1274726323747, responses,RS_1Y7SbqSTsZrYe7a-1274328697838,1274478031930, responses,RS_1ZH5TB5OdW4BVLm-1274216239894,1274538267659, responses,RS_3BHc4KyoM3q72Yc-1274290546987,1274502062319, responses,RS_3ra9BaBMAXFAvbK-127421457
      9958,1274381552543, responses,RS_6SDrGNuyyLd3oR6-1274219941155,1274385453586, responses,RS_8AGCEMWbI6mZuoQ-1274306857429,1274319602718, responses,RS_8C8T9DN47uwTG1S-1274215381765,1274289112817, responses,RS_8J5wmdmKmJXzK6g-1274299593861,1274494738952, responses,RS_8e5Sz0HeFPAdb6c-1274288
      641459,1274495868557, responses,RS_8rjcnmBXPKzI896-1274306981684,1274403047940, responses,RS_9FS3VedcyrF0KX2-1274245971331,1274754745013, responses,RS_9oZgPtxO31npv3C-1274214027769,1274396489756, responses,RS_a3FdO2jhqWuy37C-1274209228660,1274399508186, responses,RS_a3LJVxwTj29MHVa-12742
      

      Then you see the too many log files:

      2010-05-25 10:53:31,364 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862/783020138 because: regionserver/192.168.0.81:60020.cacheFlusher
      2010-05-25 10:53:32,364 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862 has too many store files, putting it back at the end of the flush queue.
      

      Which leads to this:

      2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 60 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
      2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 84 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
      2010-05-25 10:53:27,065 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 1 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
      

      Once the compaction / split is done a flush is able to happen which unblocks the IPC allowing writes to continue. Unfortunately this process can take upwards of 15+ minutes (the specific case shown here from our logs took about 4 minutes).

      1. PriorityQueue-r996664.patch
        24 kB
        Jeff Whiting
      2. prioritycompactionqueue-0.20.4.patch
        27 kB
        Jeff Whiting
      3. 2646-v3.txt
        9 kB
        stack
      4. 2646-v2.txt
        35 kB
        stack
      5. 2646-fix-race-condition-r1004349.txt
        4 kB
        Jeff Whiting

        Issue Links

          Activity

          Jeff Whiting created issue -
          Jeff Whiting made changes -
          Field Original Value New Value
          Description While testing the write capacity of a 4 machine hbase cluster we were getting long and frequent client pauses as we attempted to load the data. Looking into the problem we'd get a relatively large compaction queue and when a region hit the "hbase.hstore.blockingStoreFiles" limit it would get block the client and the compaction request would get put on the back of the queue waiting for many other less important compactions. The client is basically stuck at that point until a compaction is done. Prioritizing the compaction requests and allowing the request that is blocking other actions go first would help solve the problem.

          You can see the problem by looking at our log files:

          You'll first see an event such as a too many HLog which will put a lot of requests on the compaction queue.
          {noformat}
          2010-05-25 10:53:26,570 INFO org.apache.hadoop.hbase.regionserver.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 22 regions(s): responseCounts,RS_6eZzLtdwhGiTwHy,1274232223324, responses,RS_0qhkL5rUmPCbx3K-1274213057242,1274513189592, responses,RS_1ANYnTegjzVIsHW-12742177419
          21,1274511001873, responses,RS_1HQ4UG5BdOlAyuE-1274216757425,1274726323747, responses,RS_1Y7SbqSTsZrYe7a-1274328697838,1274478031930, responses,RS_1ZH5TB5OdW4BVLm-1274216239894,1274538267659, responses,RS_3BHc4KyoM3q72Yc-1274290546987,1274502062319, responses,RS_3ra9BaBMAXFAvbK-127421457
          9958,1274381552543, responses,RS_6SDrGNuyyLd3oR6-1274219941155,1274385453586, responses,RS_8AGCEMWbI6mZuoQ-1274306857429,1274319602718, responses,RS_8C8T9DN47uwTG1S-1274215381765,1274289112817, responses,RS_8J5wmdmKmJXzK6g-1274299593861,1274494738952, responses,RS_8e5Sz0HeFPAdb6c-1274288
          641459,1274495868557, responses,RS_8rjcnmBXPKzI896-1274306981684,1274403047940, responses,RS_9FS3VedcyrF0KX2-1274245971331,1274754745013, responses,RS_9oZgPtxO31npv3C-1274214027769,1274396489756, responses,RS_a3FdO2jhqWuy37C-1274209228660,1274399508186, responses,RS_a3LJVxwTj29MHVa-12742
          {/noformat}

          Then you see the too many log files:
          {noformat}
          2010-05-25 10:53:31,364 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862/783020138 because: regionserver/192.168.0.81:60020.cacheFlusher
          2010-05-25 10:53:32,364 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862 has too many store files, putting it back at the end of the flush queue.
          {/noformat}

          Which leads to this:
          {noformat}
          2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 60 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
          2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 84 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
          2010-05-25 10:53:27,065 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 1 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
          {/noformat}



          Once the compaction / split is done a flush is able to happen which unblocks the IPC allowing writes to continue. Unfortunately this process can take upwards of 15+ minutes (the specific case shown here from our logs took about 4 minutes).
          While testing the write capacity of a 4 machine hbase cluster we were getting long and frequent client pauses as we attempted to load the data. Looking into the problem we'd get a relatively large compaction queue and when a region hit the "hbase.hstore.blockingStoreFiles" limit it would get block the client and the compaction request would get put on the back of the queue waiting for many other less important compactions. The client is basically stuck at that point until a compaction is done. Prioritizing the compaction requests and allowing the request that is blocking other actions go first would help solve the problem.

          You can see the problem by looking at our log files:

          You'll first see an event such as a too many HLog which will put a lot of requests on the compaction queue.
          {noformat}
          2010-05-25 10:53:26,570 INFO org.apache.hadoop.hbase.regionserver.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 22 regions(s): responseCounts,RS_6eZzLtdwhGiTwHy,1274232223324, responses,RS_0qhkL5rUmPCbx3K-1274213057242,1274513189592, responses,RS_1ANYnTegjzVIsHW-12742177419
          21,1274511001873, responses,RS_1HQ4UG5BdOlAyuE-1274216757425,1274726323747, responses,RS_1Y7SbqSTsZrYe7a-1274328697838,1274478031930, responses,RS_1ZH5TB5OdW4BVLm-1274216239894,1274538267659, responses,RS_3BHc4KyoM3q72Yc-1274290546987,1274502062319, responses,RS_3ra9BaBMAXFAvbK-127421457
          9958,1274381552543, responses,RS_6SDrGNuyyLd3oR6-1274219941155,1274385453586, responses,RS_8AGCEMWbI6mZuoQ-1274306857429,1274319602718, responses,RS_8C8T9DN47uwTG1S-1274215381765,1274289112817, responses,RS_8J5wmdmKmJXzK6g-1274299593861,1274494738952, responses,RS_8e5Sz0HeFPAdb6c-1274288
          641459,1274495868557, responses,RS_8rjcnmBXPKzI896-1274306981684,1274403047940, responses,RS_9FS3VedcyrF0KX2-1274245971331,1274754745013, responses,RS_9oZgPtxO31npv3C-1274214027769,1274396489756, responses,RS_a3FdO2jhqWuy37C-1274209228660,1274399508186, responses,RS_a3LJVxwTj29MHVa-12742
          {noformat}

          Then you see the too many log files:
          {noformat}
          2010-05-25 10:53:31,364 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862/783020138 because: regionserver/192.168.0.81:60020.cacheFlusher
          2010-05-25 10:53:32,364 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862 has too many store files, putting it back at the end of the flush queue.
          {noformat}

          Which leads to this:
          {noformat}
          2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 60 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
          2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 84 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
          2010-05-25 10:53:27,065 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 1 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore size 128.0m is >= than blocking 128.0m size
          {noformat}



          Once the compaction / split is done a flush is able to happen which unblocks the IPC allowing writes to continue. Unfortunately this process can take upwards of 15+ minutes (the specific case shown here from our logs took about 4 minutes).
          Hide
          Jeff Whiting added a comment -

          We were able to mitigate our problem somewhat by changing the following configuration parameters:

          hbase.hregion.memstore.block.multiplier => 3
          hbase.hstore.blockingStoreFiles => 28
          hbase.hstore.compactionThreshold => 7

          Unfortunately this just puts off compaction and may make it last a lot longer. I've already begun developing a PriorityCompactionQueue to see if that could be a solution.

          Show
          Jeff Whiting added a comment - We were able to mitigate our problem somewhat by changing the following configuration parameters: hbase.hregion.memstore.block.multiplier => 3 hbase.hstore.blockingStoreFiles => 28 hbase.hstore.compactionThreshold => 7 Unfortunately this just puts off compaction and may make it last a lot longer. I've already begun developing a PriorityCompactionQueue to see if that could be a solution.
          Hide
          Jonathan Gray added a comment -

          Hey Jeff. Prioritized compactions and a bunch of other really good compaction/split improvements are on the way for the next major hbase release.

          The work started in HBASE-2375 but a remaining bug in the patch from that issue has led to a larger refactoring.

          Show
          Jonathan Gray added a comment - Hey Jeff. Prioritized compactions and a bunch of other really good compaction/split improvements are on the way for the next major hbase release. The work started in HBASE-2375 but a remaining bug in the patch from that issue has led to a larger refactoring.
          Hide
          Jeff Whiting added a comment -

          Here is my first go at a patch to prioritize compaction requests. Right now there are 3 levels that a request can take, LOW, NORMAL, HIGH_BLOCKING. Right now the only request that has a HIGH_BLOCKING priority is when the memstore cannot do a flush because it has too many hstore files. All other requests are NORMAL. LOW currently is unused.

          One thing I really like about the patch it that it abstracts everything about the queue. CompactSplitThread no longer has to maintain both the queue and the hashset as all of that is now handled by the PriorityCompactionQueue. It now only has to put on and take off the regions and that is it.

          PriorityCompactionQueue basically has 2 modes of operation. The default mode is to always give the higher priority compaction requests precedence. The only downside to this is it could lead to starvation of lower priority requests (although if there is more important work to be done shouldn't just be doing that?). The second mode prevents the starvation by allowing a request to raise its priority after it has been in the queue for a specified amount of time. For example with a 10 second priority elevation time, a LOW priority request would be elevated to a NORMAL priority request after 10 seconds. This parameter can be tuned with the hbase.regionserver.thread.priorityElevationTime configuration parameter (a value of -1 means it uses the first mode).

          The patch is against the 0.20.4 tag and I've included two new files PriorityCompactionQueue.java and TestPriorityCompactionQueue.java. Those two files are thew new compaction queue and a unit test for it. In addition I made all the necessary changes to the CompactSplitThread and MemStoreFlusher.

          We've tested this patch in our environment with great results. We were able to lower our parameters to the following with no client pauses:

          hbase.hregion.memstore.block.multiplier => 2
          hbase.hstore.blockingStoreFiles => 2
          hbase.hstore.compactionThreshold => 4

          Show
          Jeff Whiting added a comment - Here is my first go at a patch to prioritize compaction requests. Right now there are 3 levels that a request can take, LOW, NORMAL, HIGH_BLOCKING. Right now the only request that has a HIGH_BLOCKING priority is when the memstore cannot do a flush because it has too many hstore files. All other requests are NORMAL. LOW currently is unused. One thing I really like about the patch it that it abstracts everything about the queue. CompactSplitThread no longer has to maintain both the queue and the hashset as all of that is now handled by the PriorityCompactionQueue. It now only has to put on and take off the regions and that is it. PriorityCompactionQueue basically has 2 modes of operation. The default mode is to always give the higher priority compaction requests precedence. The only downside to this is it could lead to starvation of lower priority requests (although if there is more important work to be done shouldn't just be doing that?). The second mode prevents the starvation by allowing a request to raise its priority after it has been in the queue for a specified amount of time. For example with a 10 second priority elevation time, a LOW priority request would be elevated to a NORMAL priority request after 10 seconds. This parameter can be tuned with the hbase.regionserver.thread.priorityElevationTime configuration parameter (a value of -1 means it uses the first mode). The patch is against the 0.20.4 tag and I've included two new files PriorityCompactionQueue.java and TestPriorityCompactionQueue.java. Those two files are thew new compaction queue and a unit test for it. In addition I made all the necessary changes to the CompactSplitThread and MemStoreFlusher. We've tested this patch in our environment with great results. We were able to lower our parameters to the following with no client pauses: hbase.hregion.memstore.block.multiplier => 2 hbase.hstore.blockingStoreFiles => 2 hbase.hstore.compactionThreshold => 4
          Jeff Whiting made changes -
          Attachment prioritycompactionqueue-0.20.4.patch [ 12446048 ]
          Hide
          Jeff Whiting added a comment -

          That is great news to hear about optimizations being made on the compaction queue. Looking at HBASE-2375 I see a lot of improvement in how the compaction is being done and being more intelligent on when to compact and split but I don't seem to see any prioritization of the compaction requests (I may be overlooking it). Maybe some of the improvements in the patch here would integrate nicely with what is already being done? It seems that whenever the an action is blocking on a compaction that the request should take precedence. But that is just my two cents.

          Show
          Jeff Whiting added a comment - That is great news to hear about optimizations being made on the compaction queue. Looking at HBASE-2375 I see a lot of improvement in how the compaction is being done and being more intelligent on when to compact and split but I don't seem to see any prioritization of the compaction requests (I may be overlooking it). Maybe some of the improvements in the patch here would integrate nicely with what is already being done? It seems that whenever the an action is blocking on a compaction that the request should take precedence. But that is just my two cents.
          Hide
          Jonathan Gray added a comment -

          This looks good and is very similar to the design I was going with. The patches up in HBASE-2375 don't do prioritized queues but that was part of the redesign that I am currently working on that grew out of that jira. There is a new prioritized executor service that will be going in as part of the master rewrite in trunk that I was planning on using for this.

          My implementation does not have a LOW priority, I'm not sure where that would make sense.

          And for preventing starvation, I would agree with your question that this doesn't actually make sense, though a cool feature that might make sense in a different context. Any request for a high priority compaction is assumed to be under the premise that it is the most important thing to happen. The desired behavior would be to complete it before allowing any lower priority compactions to complete? If a system is starved for resources, we would still prefer to allow high priority compactions to complete before allowing any low priority? With a poorly set elevation time you would end up with everything in one priority defeating the purpose.

          This is too big of a chance for the 0.20 line but definitely something we should do in trunk. Can we put this jira on hold for another week or two when I get back to working on compaction improvements? I'd like to get some of the new master changes in before rebasing this on trunk.

          Thanks for all the work, this looks great.

          Show
          Jonathan Gray added a comment - This looks good and is very similar to the design I was going with. The patches up in HBASE-2375 don't do prioritized queues but that was part of the redesign that I am currently working on that grew out of that jira. There is a new prioritized executor service that will be going in as part of the master rewrite in trunk that I was planning on using for this. My implementation does not have a LOW priority, I'm not sure where that would make sense. And for preventing starvation, I would agree with your question that this doesn't actually make sense, though a cool feature that might make sense in a different context. Any request for a high priority compaction is assumed to be under the premise that it is the most important thing to happen. The desired behavior would be to complete it before allowing any lower priority compactions to complete? If a system is starved for resources, we would still prefer to allow high priority compactions to complete before allowing any low priority? With a poorly set elevation time you would end up with everything in one priority defeating the purpose. This is too big of a chance for the 0.20 line but definitely something we should do in trunk. Can we put this jira on hold for another week or two when I get back to working on compaction improvements? I'd like to get some of the new master changes in before rebasing this on trunk. Thanks for all the work, this looks great.
          Jonathan Gray made changes -
          Fix Version/s 0.21.0 [ 12313607 ]
          Priority Major [ 3 ] Critical [ 2 ]
          Hide
          Jeff Whiting added a comment -

          Your plan sounds good. There is a lot going on with the master rewrite and a lot of variables up in the air so it seems good to push this out to 0.21. Although I plan to to keep using in my environment

          I agree that in a situation where there isn't enough resources we should just focus on high priority items and not worry about starvation of low priority items. I also agree that the priorityElevationTime would be a tricky parameter to mess with and could make the priority queue worthless. Besides if a low priority item somehow becomes critical it will naturally be re-requested with a higher priority. As far as the LOW priority I'm don't really know when that would be used. I included it to provide a more complete implementation thinking that someone else might have a good reason to use it.

          Here is a random thought on when a low priority compaction would make sense. Lets say a the region just barely went above "hbase.hstore.compactionThreshold" (t) but was still a long ways off from "hbase.hstore.blockingStoreFiles" (b) so it would request a compaction but of low priority. But once it got halfway between compactionThreshold and blockingStoreFiles, (b+t) / 2, it would then re-request the compaction but at a normal priority. I'm not really sure how beneficial this would be but it the only thing coming to mind as to why you'd use a low priority.

          Show
          Jeff Whiting added a comment - Your plan sounds good. There is a lot going on with the master rewrite and a lot of variables up in the air so it seems good to push this out to 0.21. Although I plan to to keep using in my environment I agree that in a situation where there isn't enough resources we should just focus on high priority items and not worry about starvation of low priority items. I also agree that the priorityElevationTime would be a tricky parameter to mess with and could make the priority queue worthless. Besides if a low priority item somehow becomes critical it will naturally be re-requested with a higher priority. As far as the LOW priority I'm don't really know when that would be used. I included it to provide a more complete implementation thinking that someone else might have a good reason to use it. Here is a random thought on when a low priority compaction would make sense. Lets say a the region just barely went above "hbase.hstore.compactionThreshold" (t) but was still a long ways off from "hbase.hstore.blockingStoreFiles" (b) so it would request a compaction but of low priority. But once it got halfway between compactionThreshold and blockingStoreFiles, (b+t) / 2, it would then re-request the compaction but at a normal priority. I'm not really sure how beneficial this would be but it the only thing coming to mind as to why you'd use a low priority.
          Hide
          Jonathan Gray added a comment -

          I think doing something that complex is asking for trouble In most cases, you'll want things that are not high priority to be processed in the order they were submitted. Assuming random distribution of writes, I think you would most often end up with the same ordering of compactions w/ or w/o the low priority.

          In reality, you're either not saturating IO, in which case none of this is really necessary (though generally you'd want high/low priority so if somehow anything blocking did happen it would go first). Or you're saturated and your compaction queue is building. In this case you want to process anything high priority (blocking) before you do anything low priority (maintenance). I think everything we do falls into one of the those buckets. If something goes into low pri queue when it hits threshold, you expect it to move up in that queue as time goes one, thereby naturally moving up when it will happen. There are no free lunches so making it a medium priority instead of low will not give you any new IO. You still generally want high pri first, then low pri in order of submission.

          I do think it could make sense in other use cases just think this is a bit too complex for compactions.

          Also, I'm planning on extending the same prio queue idea to flushes. The other part of the change is to allow concurrent compactions and flushes.

          Show
          Jonathan Gray added a comment - I think doing something that complex is asking for trouble In most cases, you'll want things that are not high priority to be processed in the order they were submitted. Assuming random distribution of writes, I think you would most often end up with the same ordering of compactions w/ or w/o the low priority. In reality, you're either not saturating IO, in which case none of this is really necessary (though generally you'd want high/low priority so if somehow anything blocking did happen it would go first). Or you're saturated and your compaction queue is building. In this case you want to process anything high priority (blocking) before you do anything low priority (maintenance). I think everything we do falls into one of the those buckets. If something goes into low pri queue when it hits threshold, you expect it to move up in that queue as time goes one, thereby naturally moving up when it will happen. There are no free lunches so making it a medium priority instead of low will not give you any new IO. You still generally want high pri first, then low pri in order of submission. I do think it could make sense in other use cases just think this is a bit too complex for compactions. Also, I'm planning on extending the same prio queue idea to flushes. The other part of the change is to allow concurrent compactions and flushes.
          Hide
          Jeff Whiting added a comment -

          I agree that for compactions having more than 2 priorities is overkill and is bound to have problems. In the general case having a low priority for a priority queue would make sense. I'm glad to hear about the concurrent compaction and flushes as that would be a huge improvement.

          Also has there been any talk of multiple compaction threads per region server? That way multiple regions could be compacted in parallel. You'd have to move the compaction queue to a more global location and have all the threads pull from one source. On lower end machines running compactions in parallel may not make sense. But for higher end machines it seems like it could pay dividends. Looking at my cluster (8 core xeons, 16gb ram, jbod 3 x 2TB) we almost always have some resources available to do parallel compactions.

          Show
          Jeff Whiting added a comment - I agree that for compactions having more than 2 priorities is overkill and is bound to have problems. In the general case having a low priority for a priority queue would make sense. I'm glad to hear about the concurrent compaction and flushes as that would be a huge improvement. Also has there been any talk of multiple compaction threads per region server? That way multiple regions could be compacted in parallel. You'd have to move the compaction queue to a more global location and have all the threads pull from one source. On lower end machines running compactions in parallel may not make sense. But for higher end machines it seems like it could pay dividends. Looking at my cluster (8 core xeons, 16gb ram, jbod 3 x 2TB) we almost always have some resources available to do parallel compactions.
          Hide
          Jonathan Gray added a comment -

          That is what I am referring to by concurrent compactions and flushes. I didn't mean compactions concurrent with flushes, I meant multiple compactions and flushes running concurrently.

          Reasoning is exactly as you describe. There are some clusters being built with 4, 8, even 12 disks per node. This would be configurable and single disk nodes would probably have it set to 1.

          Show
          Jonathan Gray added a comment - That is what I am referring to by concurrent compactions and flushes. I didn't mean compactions concurrent with flushes, I meant multiple compactions and flushes running concurrently. Reasoning is exactly as you describe. There are some clusters being built with 4, 8, even 12 disks per node. This would be configurable and single disk nodes would probably have it set to 1.
          Jeff Hammerbacher made changes -
          Link This issue is related to HBASE-2375 [ HBASE-2375 ]
          Hide
          stack added a comment -

          This is a nice looking patch (just looked at it). Tempted to put it into 0.20.5 but its too risky a change.

          Show
          stack added a comment - This is a nice looking patch (just looked at it). Tempted to put it into 0.20.5 but its too risky a change.
          Jonathan Gray made changes -
          Link This issue relates to HBASE-2832 [ HBASE-2832 ]
          Hide
          Todd Lipcon added a comment -

          Hey Jeff. Thanks for pointing to this issue on the list. Any chance you'd be willing to update it to trunk like you suggested?

          I agree that we're better with this than nothing, while we wait on Jonathan's fancier patch. (unless you're pretty much done, jgray?)

          Show
          Todd Lipcon added a comment - Hey Jeff. Thanks for pointing to this issue on the list. Any chance you'd be willing to update it to trunk like you suggested? I agree that we're better with this than nothing, while we wait on Jonathan's fancier patch. (unless you're pretty much done, jgray?)
          Todd Lipcon made changes -
          Link This issue is related to HBASE-2981 [ HBASE-2981 ]
          Hide
          Jeff Whiting added a comment -

          A patch for this fix against trunk at revision 9966644.

          Show
          Jeff Whiting added a comment - A patch for this fix against trunk at revision 9966644.
          Jeff Whiting made changes -
          Attachment PriorityQueue-r996664.patch [ 12454478 ]
          Hide
          Jeff Whiting added a comment -

          I'm not sure how long we'll actually need the code before the concurrent compaction refactor goes in, but here it is. I think it is a solid improvement that would only be superseded by concurrent priority compactions.

          As per the discussion in this jira I've removed the priorityElevationTime feature and simplified the code.

          Show
          Jeff Whiting added a comment - I'm not sure how long we'll actually need the code before the concurrent compaction refactor goes in, but here it is. I think it is a solid improvement that would only be superseded by concurrent priority compactions. As per the discussion in this jira I've removed the priorityElevationTime feature and simplified the code.
          Hide
          Andrew Purtell added a comment -

          Target for branch 0.20 also.

          Show
          Andrew Purtell added a comment - Target for branch 0.20 also.
          Andrew Purtell made changes -
          Fix Version/s 0.20.7 [ 12315195 ]
          Hide
          stack added a comment -

          Something wrong w/ original patch and it was a little unorthodox anyways...Making it apply to TRUNK. Not done yet. Want to make the test junit4 too.......

          Show
          stack added a comment - Something wrong w/ original patch and it was a little unorthodox anyways...Making it apply to TRUNK. Not done yet. Want to make the test junit4 too.......
          stack made changes -
          Attachment 2646-v2.txt [ 12455780 ]
          Hide
          stack added a comment -

          Oh, sorry, yeah, I'm trying to patch this in. It looks great.

          Show
          stack added a comment - Oh, sorry, yeah, I'm trying to patch this in. It looks great.
          Hide
          stack added a comment -

          Here's what I'm applying. Its formatting and change of test from junit3 to junit4 (Nice test Jeff).

          Show
          stack added a comment - Here's what I'm applying. Its formatting and change of test from junit3 to junit4 (Nice test Jeff).
          stack made changes -
          Attachment 2646-v3.txt [ 12455798 ]
          Hide
          stack added a comment -

          Committed to TRUNK. Leaving issue open in case we do a 0.20.7; if we do we can commit it there too.

          Thanks for the patch Jeff and thanks for your patience w/ our slow application of it.

          Show
          stack added a comment - Committed to TRUNK. Leaving issue open in case we do a 0.20.7; if we do we can commit it there too. Thanks for the patch Jeff and thanks for your patience w/ our slow application of it.
          stack made changes -
          Fix Version/s 0.90.0 [ 12313607 ]
          Hide
          Jonathan Gray added a comment -

          FYI v3 patch is not complete (contains HBASE-3038 and does not contain the two new files)

          Show
          Jonathan Gray added a comment - FYI v3 patch is not complete (contains HBASE-3038 and does not contain the two new files)
          Hide
          Ted Yu added a comment -

          Summarizing discussion with Jeff:
          Ted: In PriorityCompactionQueue.addToRegionsInQueue(), I noticed the following call which is not synchronized:
          queue.remove(queuedRequest);

          Now suppose before the above is executed, PriorityCompactionQueue.take() is called. So queuedRequest is returned to the caller of take(). Later, this line in take():
          removeFromRegionsInQueue(cr.getHRegion());
          would remove the newly added, higher priority request from regionsInQueue.

          Jeff:
          That is an astute observation. Stepping through the code with the threads stopping execution at the points in code you suggest would indeed make it so take() would return the lower priority compactionRequest, remove the higher priority compaction request from regionsInQueue, and finally the add() method would complete and add the higher priority compaction onto the queue with no corresponding entry in the regionsInQueue hash (this is bad). Even if I move the queue.remove(queuedRequest) into the synchronized(regionsInQueue) we will run into the same problem.

          Fortunately the worst thing that can happen is there is a request that doesn't have an entry in regionsInQueue that will eventually be executed with no adverse problem for the system other than extra work. This wont actually cause any problems to the system but PriorityCompactionQueue will have an inconsistent state which should be fixed. An immediate solution is not jumping out at me. So I need to think through the problem and see if I can't come up with a way to prevent the inconsistency.

          Ted:
          Except for remove(Object r), all callers of removeFromRegionsInQueue() have CompactionRequest information.
          So CompactionRequest, cr, can be passed to removeFromRegionsInQueue() so that we can perform some sanity check.
          If cr.getPriority() is lower than the priority of the CompactionRequest currently in regionsInQueue, removeFromRegionsInQueue() can return null which indicates inconsistency.
          The caller can discard cr upon seeing null from removeFromRegionsInQueue() and try to get the next request from queue.

          The above avoids introducing another synchronization between accesses to queue and regionsInQueue.

          Jeff:
          I was thinking along the same lines. Adding an additional synchronization didn't seem like the right approach. So if we make sure we are taking off what we are expecting to then there wont be a problem.

          Show
          Ted Yu added a comment - Summarizing discussion with Jeff: Ted: In PriorityCompactionQueue.addToRegionsInQueue(), I noticed the following call which is not synchronized: queue.remove(queuedRequest); Now suppose before the above is executed, PriorityCompactionQueue.take() is called. So queuedRequest is returned to the caller of take(). Later, this line in take(): removeFromRegionsInQueue(cr.getHRegion()); would remove the newly added, higher priority request from regionsInQueue. Jeff: That is an astute observation. Stepping through the code with the threads stopping execution at the points in code you suggest would indeed make it so take() would return the lower priority compactionRequest, remove the higher priority compaction request from regionsInQueue, and finally the add() method would complete and add the higher priority compaction onto the queue with no corresponding entry in the regionsInQueue hash (this is bad). Even if I move the queue.remove(queuedRequest) into the synchronized(regionsInQueue) we will run into the same problem. Fortunately the worst thing that can happen is there is a request that doesn't have an entry in regionsInQueue that will eventually be executed with no adverse problem for the system other than extra work. This wont actually cause any problems to the system but PriorityCompactionQueue will have an inconsistent state which should be fixed. An immediate solution is not jumping out at me. So I need to think through the problem and see if I can't come up with a way to prevent the inconsistency. Ted: Except for remove(Object r), all callers of removeFromRegionsInQueue() have CompactionRequest information. So CompactionRequest, cr, can be passed to removeFromRegionsInQueue() so that we can perform some sanity check. If cr.getPriority() is lower than the priority of the CompactionRequest currently in regionsInQueue, removeFromRegionsInQueue() can return null which indicates inconsistency. The caller can discard cr upon seeing null from removeFromRegionsInQueue() and try to get the next request from queue. The above avoids introducing another synchronization between accesses to queue and regionsInQueue. Jeff: I was thinking along the same lines. Adding an additional synchronization didn't seem like the right approach. So if we make sure we are taking off what we are expecting to then there wont be a problem.
          Hide
          stack added a comment -

          On commit, I'd added missing files and then afterward backed out hbase-3038.

          Show
          stack added a comment - On commit, I'd added missing files and then afterward backed out hbase-3038.
          Hide
          Jeff Whiting added a comment -

          As per my discussion with Ted Yu, here is a patch that addresses the potential race condition with queue.remove(queuedRequest). It does a sanity check to make sure that what I'm removing from regionsInQueue is what I expect otherwise I leave it on. The patch views this.queue as the authoritative source on the which requests to run and will always return whatever was removed from the queue.

          Show
          Jeff Whiting added a comment - As per my discussion with Ted Yu, here is a patch that addresses the potential race condition with queue.remove(queuedRequest). It does a sanity check to make sure that what I'm removing from regionsInQueue is what I expect otherwise I leave it on. The patch views this.queue as the authoritative source on the which requests to run and will always return whatever was removed from the queue.
          Jeff Whiting made changes -
          Attachment 2464-fix-race-condition-r1004349.txt [ 12456303 ]
          Jeff Whiting made changes -
          Attachment 2464-fix-race-condition-r1004349.txt [ 12456303 ]
          Hide
          Jeff Whiting added a comment -

          Incorrectly named the attachment.

          Show
          Jeff Whiting added a comment - Incorrectly named the attachment.
          Jeff Whiting made changes -
          Attachment 2646-fix-race-condition-r1004349.txt [ 12456304 ]
          Hide
          stack added a comment -

          Thank you Jeff. I applied your patch. Best to get it in now before we start in on 0.90 stabilizations. Thanks for persistting (And Ted for review).

          Show
          stack added a comment - Thank you Jeff. I applied your patch. Best to get it in now before we start in on 0.90 stabilizations. Thanks for persistting (And Ted for review).
          Jeff Hammerbacher made changes -
          Link This issue relates to HBASE-3160 [ HBASE-3160 ]
          stack made changes -
          Fix Version/s 0.20.7 [ 12315195 ]
          Hide
          Otis Gospodnetic added a comment -

          Should Fix Version/s be set for this one, so it doesn't get missed? Looks important and ready with a patch.

          Show
          Otis Gospodnetic added a comment - Should Fix Version/s be set for this one, so it doesn't get missed? Looks important and ready with a patch.
          Hide
          Ted Yu added a comment -

          This has been integrated - PriorityCompactionQueue is used by CompactSplitThread now.

          Show
          Ted Yu added a comment - This has been integrated - PriorityCompactionQueue is used by CompactSplitThread now.
          Hide
          stack added a comment -

          This was applied a while back. Resolving. Thanks for the patch Jeff (Assigned it to you).

          Show
          stack added a comment - This was applied a while back. Resolving. Thanks for the patch Jeff (Assigned it to you).
          stack made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Assignee Jeff Whiting [ whitingj ]
          Fix Version/s 0.90.0 [ 12313607 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              Jeff Whiting
              Reporter:
              Jeff Whiting
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development