HBase
  1. HBase
  2. HBASE-2375

Revisit compaction configuration parameters

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 0.20.3
    • Fix Version/s: None
    • Component/s: regionserver

      Description

      Currently we will make the decision to split a region when a single StoreFile in a single family exceeds the maximum region size. This issue is about changing the decision to split to be based on the aggregate size of all StoreFiles in a single family (but still not aggregating across families). This would move a check to split after flushes rather than after compactions. This issue should also deal with revisiting our default values for some related configuration parameters.

      The motivating factor for this change comes from watching the behavior of RegionServers during heavy write scenarios.

      Today the default behavior goes like this:

      • We fill up regions, and as long as you are not under global RS heap pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) StoreFiles.
      • After we get 3 StoreFiles (hbase.hstore.compactionThreshold) we trigger a compaction on this region.
      • Compaction queues notwithstanding, this will create a 192MB file, not triggering a split based on max region size (hbase.hregion.max.filesize).
      • You'll then flush two more 64MB MemStores and hit the compactionThreshold and trigger a compaction.
      • You end up with 192 + 64 + 64 in a single compaction. This will create a single 320MB and will trigger a split.
      • While you are performing the compaction (which now writes out 64MB more than the split size, so is about 5X slower than the time it takes to do a single flush), you are still taking on additional writes into MemStore.
      • Compaction finishes, decision to split is made, region is closed. The region now has to flush whichever edits made it to MemStore while the compaction ran. This flushing, in our tests, is by far the dominating factor in how long data is unavailable during a split. We measured about 1 second to do the region closing, master assignment, reopening. Flushing could take 5-6 seconds, during which time the region is unavailable.
      • The daughter regions re-open on the same RS. Immediately when the StoreFiles are opened, a compaction is triggered across all of their StoreFiles because they contain references. Since we cannot currently split a split, we need to not hang on to these references for long.

      This described behavior is really bad because of how often we have to rewrite data onto HDFS. Imports are usually just IO bound as the RS waits to flush and compact. In the above example, the first cell to be inserted into this region ends up being written to HDFS 4 times (initial flush, first compaction w/ no split decision, second compaction w/ split decision, third compaction on daughter region). In addition, we leave a large window where we take on edits (during the second compaction of 320MB) and then must make the region unavailable as we flush it.

      If we increased the compactionThreshold to be 5 and determined splits based on aggregate size, the behavior becomes:

      • We fill up regions, and as long as you are not under global RS heap pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) StoreFiles.
      • After each MemStore flush, we calculate the aggregate size of all StoreFiles. We can also check the compactionThreshold. For the first three flushes, both would not hit the limit. On the fourth flush, we would see total aggregate size = 256MB and determine to make a split.
      • Decision to split is made, region is closed. This time, the region just has to flush out whichever edits made it to the MemStore during the snapshot/flush of the previous MemStore. So this time window has shrunk by more than 75% as it was the time to write 64MB from memory not 320MB from aggregating 5 hdfs files. This will greatly reduce the time data is unavailable during splits.
      • The daughter regions re-open on the same RS. Immediately when the StoreFiles are opened, a compaction is triggered across all of their StoreFiles because they contain references. This would stay the same.

      In this example, we only write a given cell twice (instead of 4 times) while drastically reducing data unavailability during splits. On the original flush, and post-split to remove references. The other benefit of post-split compaction (which doesn't change) is that we then get good data locality as the resulting StoreFile will be written to the local DataNode. In another jira, we should deal with opening up one of the daughter regions on a different RS to distribute load better, but that's outside the scope of this one.

      1. HBASE-2375-flush-split.patch
        0.9 kB
        Jean-Daniel Cryans
      2. HBASE-2375-v8.patch
        22 kB
        Jonathan Gray

        Issue Links

          Activity

          Hide
          stack added a comment -

          Split early has been committed too.

          All that remains of this issue is upping default compaction threshold.

          Show
          stack added a comment - Split early has been committed too. All that remains of this issue is upping default compaction threshold.
          Hide
          Jean-Daniel Cryans added a comment -

          Changing this jira's title to reflect that the first part of it is already done and the only work remaining is revisiting the compaction configurations. I created HBASE-5393 for the split after flush patch.

          Show
          Jean-Daniel Cryans added a comment - Changing this jira's title to reflect that the first part of it is already done and the only work remaining is revisiting the compaction configurations. I created HBASE-5393 for the split after flush patch.
          Hide
          stack added a comment -

          +1 on patch

          Make new issue to change compactionThreshod to at least 4 I'd say and then you can close out this one?

          Show
          stack added a comment - +1 on patch Make new issue to change compactionThreshod to at least 4 I'd say and then you can close out this one?
          Hide
          Jean-Daniel Cryans added a comment -

          This patch adds a check for split after flushing, I've been running this with write heavy tests and it makes it makes you skip at least one compaction before splitting.

          Show
          Jean-Daniel Cryans added a comment - This patch adds a check for split after flushing, I've been running this with write heavy tests and it makes it makes you skip at least one compaction before splitting.
          Hide
          stack added a comment -

          Upping compactionThreshold from 3 to 5... Sounds like a change that can have a bigger impact but that mostly helps this specific use case...

          Dunno. 3 strikes me as one of those decisions that made sense long time ago but a bunch has changed since... We should test it I suppose.

          On changing flush/regionsize, you'd rather have us split faster then slow as count of regions goes up. Ok.

          Show
          stack added a comment - Upping compactionThreshold from 3 to 5... Sounds like a change that can have a bigger impact but that mostly helps this specific use case... Dunno. 3 strikes me as one of those decisions that made sense long time ago but a bunch has changed since... We should test it I suppose. On changing flush/regionsize, you'd rather have us split faster then slow as count of regions goes up. Ok.
          Hide
          Jean-Daniel Cryans added a comment -

          Doing first bullet point only sounds good. Lets file issues for the split other suggestions.

          Kewl.

          Upping compactionThreshold from 3 to 5 where 5 is > than the number of flushes it would take to make us splittable; i.e. the intent is no compaction before first split.

          Sounds like a change that can have a bigger impact but that mostly helps this specific use case...

          Instead up the compactionThreshold and down the default regionsize from 1G to 512M and keep flush at 128M?

          I'd rather split earlier for the first regions.

          Show
          Jean-Daniel Cryans added a comment - Doing first bullet point only sounds good. Lets file issues for the split other suggestions. Kewl. Upping compactionThreshold from 3 to 5 where 5 is > than the number of flushes it would take to make us splittable; i.e. the intent is no compaction before first split. Sounds like a change that can have a bigger impact but that mostly helps this specific use case... Instead up the compactionThreshold and down the default regionsize from 1G to 512M and keep flush at 128M? I'd rather split earlier for the first regions.
          Hide
          stack added a comment -

          Doing first bullet point only sounds good. Lets file issues for the split other suggestions.

          What about the other recommendations made up in the issue regards compactionThreshold.

          Upping compactionThreshold from 3 to 5 where 5 is > than the number of flushes it would take to make us splittable; i.e. the intent is no compaction before first split.

          Should we do this too as part of this issue? We could make our flush size 256M and compactionThreshold 5. Or perhaps thats too rad (thats a big Map to be carrying around)? Instead up the compactionThreshold and down the default regionsize from 1G to 512M and keep flush at 128M?

          I took a look at patch and its pretty stale now given changes that have gone in since.

          Show
          stack added a comment - Doing first bullet point only sounds good. Lets file issues for the split other suggestions. What about the other recommendations made up in the issue regards compactionThreshold. Upping compactionThreshold from 3 to 5 where 5 is > than the number of flushes it would take to make us splittable; i.e. the intent is no compaction before first split. Should we do this too as part of this issue? We could make our flush size 256M and compactionThreshold 5. Or perhaps thats too rad (thats a big Map to be carrying around)? Instead up the compactionThreshold and down the default regionsize from 1G to 512M and keep flush at 128M? I took a look at patch and its pretty stale now given changes that have gone in since.
          Hide
          Jean-Daniel Cryans added a comment -

          A bunch of things changed since this jira was created:

          • we now split based on the store size
          • regions split at 1GB
          • memstores flush at 128MB
          • there's been a lot of work on tuning the store file selection algorithm

          My understanding of this jira is that it aims at making the "out of the box mass import" experience better. Now that we have bulk loads and pre-splitting this use case is becoming less and less important... although we still see people trying to benchmark it (hi hypertable).

          I see three things we could do:

          • Trigger splits after flushes, I hacked a patch and it works awesomely
          • Have a lower split size for newly created tables. Hypertable does this with a soft limit that gets doubled every time the table splits until it reaches the normal split size
          • Have multi-way splits (Todd's idea), so that if you have enough data that you know you're going to be splitting after the current split then just spawn as many daughters as you need.

          I'm planning on just fixing the first bullet point in the context of this jira. Maybe there's another stuff from the patch in this jira that we could fit in.

          Show
          Jean-Daniel Cryans added a comment - A bunch of things changed since this jira was created: we now split based on the store size regions split at 1GB memstores flush at 128MB there's been a lot of work on tuning the store file selection algorithm My understanding of this jira is that it aims at making the "out of the box mass import" experience better. Now that we have bulk loads and pre-splitting this use case is becoming less and less important... although we still see people trying to benchmark it (hi hypertable). I see three things we could do: Trigger splits after flushes, I hacked a patch and it works awesomely Have a lower split size for newly created tables. Hypertable does this with a soft limit that gets doubled every time the table splits until it reaches the normal split size Have multi-way splits (Todd's idea), so that if you have enough data that you know you're going to be splitting after the current split then just spawn as many daughters as you need. I'm planning on just fixing the first bullet point in the context of this jira. Maybe there's another stuff from the patch in this jira that we could fit in.
          Hide
          Jonathan Gray added a comment -

          punting from 0.92. still needs to be done but should not be tied to a version until work is being actively done

          Show
          Jonathan Gray added a comment - punting from 0.92. still needs to be done but should not be tied to a version until work is being actively done
          Hide
          stack added a comment -

          HBASE-3312 has an interesting pathological case that this issue is supposed to fix. Linking to it so we don't forget.

          Show
          stack added a comment - HBASE-3312 has an interesting pathological case that this issue is supposed to fix. Linking to it so we don't forget.
          Hide
          Jonathan Gray added a comment -

          Punting to 0.92 for now. The bigger compaction/flush improvements should happen in that version.

          Show
          Jonathan Gray added a comment - Punting to 0.92 for now. The bigger compaction/flush improvements should happen in that version.
          Hide
          Jean-Daniel Cryans added a comment -

          Do we have enough time do this if 0.90.0 is due for HW? Punt?

          Show
          Jean-Daniel Cryans added a comment - Do we have enough time do this if 0.90.0 is due for HW? Punt?
          Hide
          Jeff Whiting added a comment -

          The optimizations here look great. It seems like an additional optimization could still be made. Looking at the patch, there doesn't seem to be any prioritization of compaction requests. So if you have a region server that is in charge of a large number of regions the compaction queue can still get quite large and prevent more important compactions from happening in a timely manner. I implemented a priority queue for compactions that may make a lot of sense to include with these optimizations (see HBASE-2646).

          Show
          Jeff Whiting added a comment - The optimizations here look great. It seems like an additional optimization could still be made. Looking at the patch, there doesn't seem to be any prioritization of compaction requests. So if you have a region server that is in charge of a large number of regions the compaction queue can still get quite large and prevent more important compactions from happening in a timely manner. I implemented a priority queue for compactions that may make a lot of sense to include with these optimizations (see HBASE-2646 ).
          Hide
          stack added a comment -

          Bulk move of 0.20.5 issues into 0.21.0 after vote that we merge branch into TRUNK up on list.

          Show
          stack added a comment - Bulk move of 0.20.5 issues into 0.21.0 after vote that we merge branch into TRUNK up on list.
          Hide
          stack added a comment -

          We need this patch (smile)

          Show
          stack added a comment - We need this patch (smile)
          Hide
          Kannan Muthukkaruppan added a comment -

          I was seeing the 'too many store files issue' recently too, but that was even without this patch. And after HBASE-2439 fix, I don't see this that much.

          One data point: Without the patch, a import test I ran with 1M keys, ~1000 columns each of ~512bytes took about 15 hrs without the patch, and 8 hours with the patch! Basically the patch halfs the number of split related compactions-- and the import test was mostly busy doing split related compactions.

          We did notice some unevenly sized regions at the end of the run... but not sure if that is related to the patch. Will look more into it tomorrow.

          Show
          Kannan Muthukkaruppan added a comment - I was seeing the 'too many store files issue' recently too, but that was even without this patch. And after HBASE-2439 fix, I don't see this that much. One data point: Without the patch, a import test I ran with 1M keys, ~1000 columns each of ~512bytes took about 15 hrs without the patch, and 8 hours with the patch! Basically the patch halfs the number of split related compactions-- and the import test was mostly busy doing split related compactions. We did notice some unevenly sized regions at the end of the run... but not sure if that is related to the patch. Will look more into it tomorrow.
          Hide
          stack added a comment -

          I did some more testing on this patch and we seem to be running into the 'too many store files' issue more often so as said up on IRC, maybe a flush that melds the memstore with a compaction would be needed now we compact less.

          Show
          stack added a comment - I did some more testing on this patch and we seem to be running into the 'too many store files' issue more often so as said up on IRC, maybe a flush that melds the memstore with a compaction would be needed now we compact less.
          Hide
          Jonathan Gray added a comment -

          Thanks stack. I will run full cluster tests on Monday and work on
          cleaning up log messages.

          Show
          Jonathan Gray added a comment - Thanks stack. I will run full cluster tests on Monday and work on cleaning up log messages.
          Hide
          stack added a comment -

          I ran the patch a while. Seems to work basically. Lets fix them ugly DEBUG NotServingRegion messages – make them one liners and not exceptions on server?

          Can you make this message better detailed:

          2010-04-03 16:10:14,872 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Split triggered on region TestTable,0115344204,1270339622626 because Store info exceeds max size

          What is max size and what is our current size and file count might be of use.

          Is this right? Its confusing at least:

          The above line was followed by this:

          2010-04-03 16:10:14,872 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region TestTable,0115344204,1270339622626/81593204 because: store size > max size

          I thought we'd just said we were going to split and here we are now compacting?

          Here is another example:

          2010-04-03 16:10:27,725 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Split triggered on region TestTable,0186659666,1270339709264 because Store info exceeds max size
          2010-04-03 16:10:27,725 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region TestTable,0186659666,1270339709264/814994791 because: store size > max size

          Good stuff Jon.

          Show
          stack added a comment - I ran the patch a while. Seems to work basically. Lets fix them ugly DEBUG NotServingRegion messages – make them one liners and not exceptions on server? Can you make this message better detailed: 2010-04-03 16:10:14,872 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Split triggered on region TestTable,0115344204,1270339622626 because Store info exceeds max size What is max size and what is our current size and file count might be of use. Is this right? Its confusing at least: The above line was followed by this: 2010-04-03 16:10:14,872 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region TestTable,0115344204,1270339622626/81593204 because: store size > max size I thought we'd just said we were going to split and here we are now compacting? Here is another example: 2010-04-03 16:10:27,725 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Split triggered on region TestTable,0186659666,1270339709264 because Store info exceeds max size 2010-04-03 16:10:27,725 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region TestTable,0186659666,1270339709264/814994791 because: store size > max size Good stuff Jon.
          Hide
          Kannan Muthukkaruppan added a comment -

          Jonathan wrote: <<<Tests pass except TestTableIndex is flakey for some reason (seems unrelated).>>>

          Yes, this started happening in our internal branch when we refreshed from our internal hadoop branch. Karthik was looking into this yesterday... seemed like some env/setup issue... but it is unrelated to your stuff.

          Show
          Kannan Muthukkaruppan added a comment - Jonathan wrote: <<<Tests pass except TestTableIndex is flakey for some reason (seems unrelated).>>> Yes, this started happening in our internal branch when we refreshed from our internal hadoop branch. Karthik was looking into this yesterday... seemed like some env/setup issue... but it is unrelated to your stuff.
          Hide
          Andrew Purtell added a comment -

          I find some e.g. IHBase tests intermittently fail. The prior test does not always shut down in time (or at all), leading to initialization problems of the subsequent test. Is it like this? Check the exceptions in the log of the failed test. Or run the failed test alone to confirm the above effect.

          Show
          Andrew Purtell added a comment - I find some e.g. IHBase tests intermittently fail. The prior test does not always shut down in time (or at all), leading to initialization problems of the subsequent test. Is it like this? Check the exceptions in the log of the failed test. Or run the failed test alone to confirm the above effect.
          Hide
          stack added a comment -

          FYI, this patch does not fail the TestTableIndex over here.

          Show
          stack added a comment - FYI, this patch does not fail the TestTableIndex over here.
          Hide
          stack added a comment -

          In 'testForceSplitMultiFamily', do you intend filling two families or just the second family? Is the bug that you we don't split if second family is over split limit?

          In Store#compact, where we do ' if(forceSplit) { '... is this ok if multiple families? If multiple families, one of the families will have the best midkey... is that what we split on or is it just the first? (I see later in MemStoreFlusher that it may have the same problem in that it will split on the first family that meets the criteria). Maybe this is ok for now. Perhaps file improvement to improve on this behavior.

          I'm do not follow this bit:

          +      // Do not trigger any splits otherwise, so always return references=true
          +      // which will prevent splitting.
          +       
                 if (!fs.exists(this.regionCompactionDir) &&
                     !fs.mkdirs(this.regionCompactionDir)) {
                   LOG.warn("Mkdir on " + this.regionCompactionDir.toString() + " failed");
          -        return checkSplit(forceSplit);
          +        return DO_NOT_SPLIT;
                 }
          

          If we falied make the compactiondir, we used to check for split, now we don't split. Whats up here?

          Same later in the file at #238 in patch and at #247.

          Should we change this data member name and the configuration that feeds it, its description at least to explain now we are doing size of Store rather then maximum file size.

          Thinking about it, we'll now split more often – because we split sooner (see your first narrative above Jon and no need to wait on compaction to finish before we split – this takes time). Also, split will make more references than in past because usually in past we'd split one file after big compaction. Now we split before compaction so the 3 or 4 files in family will all be split using References. Ain't sure if this will make any difference. Just stating what will happen.

          This change in default needs to make it out to hbase-default.xml:

          484 + conf.getInt("hbase.hstore.compactionThreshold", 5);

          Otherwise, patch looks good. Did you try it?

          Show
          stack added a comment - In 'testForceSplitMultiFamily', do you intend filling two families or just the second family? Is the bug that you we don't split if second family is over split limit? In Store#compact, where we do ' if(forceSplit) { '... is this ok if multiple families? If multiple families, one of the families will have the best midkey... is that what we split on or is it just the first? (I see later in MemStoreFlusher that it may have the same problem in that it will split on the first family that meets the criteria). Maybe this is ok for now. Perhaps file improvement to improve on this behavior. I'm do not follow this bit: + // Do not trigger any splits otherwise, so always return references= true + // which will prevent splitting. + if (!fs.exists( this .regionCompactionDir) && !fs.mkdirs( this .regionCompactionDir)) { LOG.warn( "Mkdir on " + this .regionCompactionDir.toString() + " failed" ); - return checkSplit(forceSplit); + return DO_NOT_SPLIT; } If we falied make the compactiondir, we used to check for split, now we don't split. Whats up here? Same later in the file at #238 in patch and at #247. Should we change this data member name and the configuration that feeds it, its description at least to explain now we are doing size of Store rather then maximum file size. Thinking about it, we'll now split more often – because we split sooner (see your first narrative above Jon and no need to wait on compaction to finish before we split – this takes time). Also, split will make more references than in past because usually in past we'd split one file after big compaction. Now we split before compaction so the 3 or 4 files in family will all be split using References. Ain't sure if this will make any difference. Just stating what will happen. This change in default needs to make it out to hbase-default.xml: 484 + conf.getInt("hbase.hstore.compactionThreshold", 5); Otherwise, patch looks good. Did you try it?
          Hide
          Jonathan Gray added a comment -

          Please review. Tests pass except TestTableIndex is flakey for some reason (seems unrelated).

          Show
          Jonathan Gray added a comment - Please review. Tests pass except TestTableIndex is flakey for some reason (seems unrelated).
          Hide
          Andrew Purtell added a comment -

          Going from .3 to .4, this is really a fork, and we've already committed to related disruption.

          We should maintain the philosophy that API and RPC (client and server interfaces) can support rolling restart upgrade on branch, everything else is open for change, especially performance improvements.

          Just my 0.02.

          Show
          Andrew Purtell added a comment - Going from .3 to .4, this is really a fork, and we've already committed to related disruption. We should maintain the philosophy that API and RPC (client and server interfaces) can support rolling restart upgrade on branch, everything else is open for change, especially performance improvements. Just my 0.02.
          Hide
          stack added a comment -

          Thanks for doing the detective work Jon. The original flush/compaction/split configuration 'balance' was set in place a very long time ago when things were very different. Thanks for taking the time to reexamine.

          I see nothing wrong w/ your narrative above.

          To avoid 5 store files – it'd be better if we never got to this point as J-D hints because looking in many storefiles at once hurts performance – maybe we could set the threshold down, to 3 again or 4 even, but so 3 files trigger a split, we might need to change flushsize. Up it? 64MB is probably bad since by time it is written out, its going to be > 64MB and so the flush file will span more than on hdfs block anyways. So we should embrace this fact and set flush up to 96MB? (This paragraph conflates a few ideas – flushing and compacting thresholds – but base idea should come across).

          Chatting up on IRC, this issue is related to hbase-1892 though it does the dbl-flush you mention over there implicitly (as you noted on IRC).

          .bq Also I don't think this should be in 0.20, the branch is disrupted enough as it is.

          I'd say it depends on how disruptive the patch. We could get big change in our i/o pattern for small code change.

          Show
          stack added a comment - Thanks for doing the detective work Jon. The original flush/compaction/split configuration 'balance' was set in place a very long time ago when things were very different. Thanks for taking the time to reexamine. I see nothing wrong w/ your narrative above. To avoid 5 store files – it'd be better if we never got to this point as J-D hints because looking in many storefiles at once hurts performance – maybe we could set the threshold down, to 3 again or 4 even, but so 3 files trigger a split, we might need to change flushsize. Up it? 64MB is probably bad since by time it is written out, its going to be > 64MB and so the flush file will span more than on hdfs block anyways. So we should embrace this fact and set flush up to 96MB? (This paragraph conflates a few ideas – flushing and compacting thresholds – but base idea should come across). Chatting up on IRC, this issue is related to hbase-1892 though it does the dbl-flush you mention over there implicitly (as you noted on IRC). .bq Also I don't think this should be in 0.20, the branch is disrupted enough as it is. I'd say it depends on how disruptive the patch. We could get big change in our i/o pattern for small code change.
          Hide
          Jonathan Gray added a comment -

          We'll work on a patch for this and try to get numbers soon. Stay tuned.

          Show
          Jonathan Gray added a comment - We'll work on a patch for this and try to get numbers soon. Stay tuned.
          Hide
          Jonathan Gray added a comment -

          Actually if you think about how things behave with lots of smaller flushes, it may be equally as ugly. If we imagine 16MB flushes, we'll do 3 * 16 -> 48, +16 +16 -> 80, +16 +16 -> 112, +16 +16 -> 146, etc... The cost of being compaction-happy is that we end up rewriting data an inordinate amount of times. The cells in the first flushed file get rewritten over and over again. Bumping from 3 to 5 reduces rewrites by a significant percentage (what is that, 40% reduction?).

          Show
          Jonathan Gray added a comment - Actually if you think about how things behave with lots of smaller flushes, it may be equally as ugly. If we imagine 16MB flushes, we'll do 3 * 16 -> 48, +16 +16 -> 80, +16 +16 -> 112, +16 +16 -> 146, etc... The cost of being compaction-happy is that we end up rewriting data an inordinate amount of times. The cells in the first flushed file get rewritten over and over again. Bumping from 3 to 5 reduces rewrites by a significant percentage (what is that, 40% reduction?).
          Hide
          Jean-Daniel Cryans added a comment -

          Right I see your point... I'm definitely interested in numbers!

          Show
          Jean-Daniel Cryans added a comment - Right I see your point... I'm definitely interested in numbers!
          Hide
          Jonathan Gray added a comment -

          It's not just in the case of rapid uploads, but makes the biggest difference in that circumstance. I don't really see the upside to requiring compactions on both sides of a split. If more flushes happen because of heap pressure, hlogs, majors, etc... that's fine. If we hit the compactionThreshold but aren't bigger than the max region size, we will still compact. Doesn't impact that behavior. We would just do it later (5 files not 3) because of configuration changes.

          I'm +1 on treating META differently. We should also investigate doing a flush + major compact as a single operation which shouldn't be so bad to implement.

          As for where to put it, I think we want to get it working here at fb, so whether it makes it into branch or not, if we get the time to implement it we'll be applying it to our internal 0.20 branch. Understand the concern though, branch is packed. In any case, we'll be doing heavy load testing on it and comparing performance numbers so we'll see if it's a big win or not. We're totally io bound so I suspect this will make a difference.

          Show
          Jonathan Gray added a comment - It's not just in the case of rapid uploads, but makes the biggest difference in that circumstance. I don't really see the upside to requiring compactions on both sides of a split. If more flushes happen because of heap pressure, hlogs, majors, etc... that's fine. If we hit the compactionThreshold but aren't bigger than the max region size, we will still compact. Doesn't impact that behavior. We would just do it later (5 files not 3) because of configuration changes. I'm +1 on treating META differently. We should also investigate doing a flush + major compact as a single operation which shouldn't be so bad to implement. As for where to put it, I think we want to get it working here at fb, so whether it makes it into branch or not, if we get the time to implement it we'll be applying it to our internal 0.20 branch. Understand the concern though, branch is packed. In any case, we'll be doing heavy load testing on it and comparing performance numbers so we'll see if it's a big win or not. We're totally io bound so I suspect this will make a difference.
          Hide
          Jean-Daniel Cryans added a comment -

          This is only in the case of rapid uploads right? In a "normal" production system you shouldn't be splitting thousands of times per day, you will flush some old regions because you hit the maximum number of hlogs or it took more than 24h to fill a region so it was major compacted, etc.

          But wrt uploads I like the idea but I have an issue applying that to .META., I'm at the point where I think that we should major compact it at every flush, so compacting it only at 5 files would be killing your performance more than the current situation is.

          Also I don't think this should be in 0.20, the branch is disrupted enough as it is.

          Show
          Jean-Daniel Cryans added a comment - This is only in the case of rapid uploads right? In a "normal" production system you shouldn't be splitting thousands of times per day, you will flush some old regions because you hit the maximum number of hlogs or it took more than 24h to fill a region so it was major compacted, etc. But wrt uploads I like the idea but I have an issue applying that to .META., I'm at the point where I think that we should major compact it at every flush, so compacting it only at 5 files would be killing your performance more than the current situation is. Also I don't think this should be in 0.20, the branch is disrupted enough as it is.

            People

            • Assignee:
              Jonathan Gray
              Reporter:
              Jonathan Gray
            • Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:

                Development