HBase
  1. HBase
  2. HBASE-3149

Make flush decisions per column family

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.89-fb
    • Component/s: regionserver
    • Labels:
      None

      Description

      Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions.

      1. 3149-trunk-v1.txt
        89 kB
        Ted Yu
      2. Per-CF-Memstore-Flush.diff
        79 kB
        Gaurav Menghani

        Issue Links

          Activity

          Hide
          Jean-Daniel Cryans added a comment -

          I have been thinking about this one for some time... I think it makes sense in loads of ways since a common problem of multi-CF is that during the initial import the user ends up with thousands of small store files because some family grows faster and triggered the flushes, which in turn generates incredible compaction churn. On the other hand, it means that we almost consider a family as a region e.g. one region with 3 CF can have up to 3x64MB in the memstores.

          Show
          Jean-Daniel Cryans added a comment - I have been thinking about this one for some time... I think it makes sense in loads of ways since a common problem of multi-CF is that during the initial import the user ends up with thousands of small store files because some family grows faster and triggered the flushes, which in turn generates incredible compaction churn. On the other hand, it means that we almost consider a family as a region e.g. one region with 3 CF can have up to 3x64MB in the memstores.
          Hide
          Karthik Ranganathan added a comment -

          Yes, agreed that the memory implication is different.

          Eventually, is it not better to enforce the memory limit by using a combination of flush sizes and restricting the number of regions we create? Because ideally we should allow different flush sizes for the different CF's as the KV sizes could be way different...

          Shall I just make this an option in the conf for now with the default the way it is?

          Show
          Karthik Ranganathan added a comment - Yes, agreed that the memory implication is different. Eventually, is it not better to enforce the memory limit by using a combination of flush sizes and restricting the number of regions we create? Because ideally we should allow different flush sizes for the different CF's as the KV sizes could be way different... Shall I just make this an option in the conf for now with the default the way it is?
          Hide
          Nicolas Spiegelberg added a comment -

          Some interesting stats. We did some rough calculations internally to see what effect an uneven distribution of data into column families was having on our network IO. Our data distribution for 3 column families was 1:1:20. When we looked at the flush:minor-compaction ratio for each of the store files, the large column family had a 1:2 ratio but the small CFs both had a 1:20 ratio! We are looking at roughly a 10% network IO decrease if we can bring those other 2 CFs down to a 1:2 ratio as well.

          Show
          Nicolas Spiegelberg added a comment - Some interesting stats. We did some rough calculations internally to see what effect an uneven distribution of data into column families was having on our network IO. Our data distribution for 3 column families was 1:1:20. When we looked at the flush:minor-compaction ratio for each of the store files, the large column family had a 1:2 ratio but the small CFs both had a 1:20 ratio! We are looking at roughly a 10% network IO decrease if we can bring those other 2 CFs down to a 1:2 ratio as well.
          Hide
          Nicolas Spiegelberg added a comment -

          This is a substantial refactoring effort. My current development strategy is to break this down into 4 parts. Each one will have a diff + review board so you guys don't get overwhelmed...

          1. move flushcache() from Region -> Store. have Region.flushcache loop through Store API
          2. move locks from Region -> Store. figure out flush/compact/split locking strategy
          3. refactor HLog to store per-CF seqnum info
          4. refactor MemStoreFlusher from regionsInQueue to storesInQueue

          Show
          Nicolas Spiegelberg added a comment - This is a substantial refactoring effort. My current development strategy is to break this down into 4 parts. Each one will have a diff + review board so you guys don't get overwhelmed... 1. move flushcache() from Region -> Store. have Region.flushcache loop through Store API 2. move locks from Region -> Store. figure out flush/compact/split locking strategy 3. refactor HLog to store per-CF seqnum info 4. refactor MemStoreFlusher from regionsInQueue to storesInQueue
          Hide
          ryan rawson added a comment -

          if you are going to generate a sequence id for every CF, then we will
          need to create and use a new synthetic ID for atomic views.

          Show
          ryan rawson added a comment - if you are going to generate a sequence id for every CF, then we will need to create and use a new synthetic ID for atomic views.
          Hide
          Nicolas Spiegelberg added a comment -

          @ryan: the main work in step #3 isn't HBASE-2856 work. It's roughly modifying HLog.lastSeqWritten from Map<region, long> => Map<store, long> and all the refactoring to the HLog code that it entails.

          Show
          Nicolas Spiegelberg added a comment - @ryan: the main work in step #3 isn't HBASE-2856 work. It's roughly modifying HLog.lastSeqWritten from Map<region, long> => Map<store, long> and all the refactoring to the HLog code that it entails.
          Hide
          ryan rawson added a comment -

          ok got it. as long as we dont generate a seqid per family we are all good.

          Show
          ryan rawson added a comment - ok got it. as long as we dont generate a seqid per family we are all good.
          Hide
          Nicolas Spiegelberg added a comment -

          While our row-level acid guarantees for gets/scans are slightly broken in the trunk (HBASE-2856), per-CF flushes will extremely aggravate this problem because a single user-level Put could span both current memstore and snapshot memstore depending upon what CF the put shard resides in. Halting progress on this issue until this jira is completed.

          Show
          Nicolas Spiegelberg added a comment - While our row-level acid guarantees for gets/scans are slightly broken in the trunk ( HBASE-2856 ), per-CF flushes will extremely aggravate this problem because a single user-level Put could span both current memstore and snapshot memstore depending upon what CF the put shard resides in. Halting progress on this issue until this jira is completed.
          Hide
          Schubert Zhang added a comment -

          This jira is very useful in practice.
          In HBase, the horizontal partitions by rowkey-ranges make regions, and the vertical partitions by column-family make stores. These horizontal and vertical partitoning schema make a data tetragonum — the store in hbase.

          The memstore is base on the store, so the flush and compaction need also be based on store. The memstoreSize in HRegion should be in HStore.

          For flexible configuration, I think we shlould be able to configure memstoresize (i.e. hbase.hregion.memstore.flush.size) in Column-Family level (when create table). And if possaible, I want the maxStoreSize also be configurable for different Column-Family.

          Show
          Schubert Zhang added a comment - This jira is very useful in practice. In HBase, the horizontal partitions by rowkey-ranges make regions, and the vertical partitions by column-family make stores. These horizontal and vertical partitoning schema make a data tetragonum — the store in hbase. The memstore is base on the store, so the flush and compaction need also be based on store. The memstoreSize in HRegion should be in HStore. For flexible configuration, I think we shlould be able to configure memstoresize (i.e. hbase.hregion.memstore.flush.size) in Column-Family level (when create table). And if possaible, I want the maxStoreSize also be configurable for different Column-Family.
          Hide
          Ted Yu added a comment -

          HBASE-4645 makes edit log recovery conservative to avoid data loss.
          When this feature is implemented, we should revisit edit log recovery by passing an array of maxSeqIds.

          Show
          Ted Yu added a comment - HBASE-4645 makes edit log recovery conservative to avoid data loss. When this feature is implemented, we should revisit edit log recovery by passing an array of maxSeqIds.
          Hide
          Mubarak Seyed added a comment -

          @Nicolas,
          Is there any update on this issue? We have a production use-case wherein 80% of data goes to one CF and remaining 20% goes to two other CFs. I can collaborate with you if you are interested to pursue with patch. Thanks.

          Show
          Mubarak Seyed added a comment - @Nicolas, Is there any update on this issue? We have a production use-case wherein 80% of data goes to one CF and remaining 20% goes to two other CFs. I can collaborate with you if you are interested to pursue with patch. Thanks.
          Hide
          Nicolas Spiegelberg added a comment -

          @Mubarak: I think you probably are more interested in tuning the compaction settings. The initial reason for this JIRA was higher network IO. The actual problem was that the min unconditional compact size was too high & caused bad compaction decision. We fixed this by lowering the min size from the default of the flush size (256MB, for us) to 4MB.

            <property>
             <name>hbase.hstore.compaction.min.size</name>
             <value>4194304</value>
             <description>
               The "minimum" compaction size. All files below this size are always
               included into a compaction, even if outside compaction ratio times
               the total size of all files added to compaction so far.
             </description>
            </property>
          

          We identified this a while ago and I thought we were going to change the default for 0.92, but it looks like it's still in the Store.java code A better use of your time would be to verify that this reduces your IO and write up a JIRA to change the default.

          Show
          Nicolas Spiegelberg added a comment - @Mubarak: I think you probably are more interested in tuning the compaction settings. The initial reason for this JIRA was higher network IO. The actual problem was that the min unconditional compact size was too high & caused bad compaction decision. We fixed this by lowering the min size from the default of the flush size (256MB, for us) to 4MB. <property> <name>hbase.hstore.compaction.min.size</name> <value>4194304</value> <description> The "minimum" compaction size. All files below this size are always included into a compaction, even if outside compaction ratio times the total size of all files added to compaction so far. </description> </property> We identified this a while ago and I thought we were going to change the default for 0.92, but it looks like it's still in the Store.java code A better use of your time would be to verify that this reduces your IO and write up a JIRA to change the default.
          Hide
          Mubarak Seyed added a comment -

          Thanks Nicolas. Will try with 4 MB and create a JIRA.

          Show
          Mubarak Seyed added a comment - Thanks Nicolas. Will try with 4 MB and create a JIRA.
          Hide
          stack added a comment -

          Thanks @Nicolas (and thanks @Mubarak – sounds like something to indeed get into 0.92).

          At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold.

          Show
          stack added a comment - Thanks @Nicolas (and thanks @Mubarak – sounds like something to indeed get into 0.92). At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold.
          Hide
          Lars Hofhansl added a comment -

          @Nicolas: Interesting bit about hstore.compaction.min.size. I'm curious, is 4MB something that works specifically for your setup, or would you generally recommend it setting it that low?
          It probably has to do with whether compression is enabled, how many CFs and relative sizes, etc.

          Maybe instead of defaulting it to flushsize, we could default it to flushsize/2 or flushsize/4...?

          Show
          Lars Hofhansl added a comment - @Nicolas: Interesting bit about hstore.compaction.min.size. I'm curious, is 4MB something that works specifically for your setup, or would you generally recommend it setting it that low? It probably has to do with whether compression is enabled, how many CFs and relative sizes, etc. Maybe instead of defaulting it to flushsize, we could default it to flushsize/2 or flushsize/4...?
          Hide
          stack added a comment -

          @Nicolas I wonder about this... hbase.hstore.compaction.min.size. When we compact, don't we have to take adjacent files as part of our ACID guarantees? This would frustrate that? (I'll take a look... tomorrow). I'm wondering because i want to figure how to make it so we favor reference files... so they are always included in a compaction.

          Show
          stack added a comment - @Nicolas I wonder about this... hbase.hstore.compaction.min.size. When we compact, don't we have to take adjacent files as part of our ACID guarantees? This would frustrate that? (I'll take a look... tomorrow). I'm wondering because i want to figure how to make it so we favor reference files... so they are always included in a compaction.
          Hide
          stack added a comment -

          Making 0.92.1 critical so it gets a bit of loving...

          Show
          stack added a comment - Making 0.92.1 critical so it gets a bit of loving...
          Hide
          Lars George added a comment -

          At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold.

          I thought so too. Setting the hbase.hstore.compaction.size to 4MB, and having the flush size at 256MB, it means you will never compact flush files larger than 4MB. So, in other words, only if you are flushing small files (say from a small, dependent column family) you are running a minor compaction on them. For the larger family you typically do not run those at all, right?

          This surely seems a specific setting for this use-case, and there are others that need a slightly different setting. If you mix those two on the same cluster, then having only one global setting to adjust this seems restrictive? Should this be a setting per table, like the flush size?

          It still seems to me that decoupling is what we should have available as well. But I thought about it for a while as well as discussed this various people: it seems that decoupling brings its own set of issues, for example, you might end up with too many HLog files because the small family is flushed only rarely.

          Show
          Lars George added a comment - At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold. I thought so too. Setting the hbase.hstore.compaction.size to 4MB, and having the flush size at 256MB, it means you will never compact flush files larger than 4MB. So, in other words, only if you are flushing small files (say from a small, dependent column family) you are running a minor compaction on them. For the larger family you typically do not run those at all, right? This surely seems a specific setting for this use-case, and there are others that need a slightly different setting. If you mix those two on the same cluster, then having only one global setting to adjust this seems restrictive? Should this be a setting per table, like the flush size? It still seems to me that decoupling is what we should have available as well. But I thought about it for a while as well as discussed this various people: it seems that decoupling brings its own set of issues, for example, you might end up with too many HLog files because the small family is flushed only rarely.
          Hide
          Lars Hofhansl added a comment -

          @Lars G: Now I am perfectly confused. The description says that "All files below this size are always included into a compaction".
          I had assumed this setting is to quickly get rid of the smaller store files. Did I misunderstand?

          Show
          Lars Hofhansl added a comment - @Lars G: Now I am perfectly confused. The description says that "All files below this size are always included into a compaction". I had assumed this setting is to quickly get rid of the smaller store files. Did I misunderstand?
          Hide
          Nicolas Spiegelberg added a comment -

          @Lars/Stack: note that the number of StoreFiles necessary to store N amount of data is order O(log N) with the existing compaction algorithm. This means that setting the compaction min size to a low value will not result in significantly more files. Furthermore, what's hurting performance is not the amount of files but the size of each file. The extra files will be very small and take up only a minority of the space in the LRU cache. Every time you unnecessarily compact files, you have to repopulate that StoreFile in the LRU cache and get a lot of disk reads in addition to the obvious write increase. This is all to say that I would recommend defaulting it to that low because the downsides are very minimal and the benefit can be substantial IO gains.

          At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold.

          Why is this silly? With cache-on-write, the data is still cached in memory. It's just migrated from the MemCache to the BlockCache, which has comparable performance. Furthermore, BlockCache data is compressed, so it then takes up less space. Flushing also minimizes the amount of HLogs and decreases recovery time. Flushing would be bad if it meant we weren't optimally using the global MemStore size, but we currently are.

          This surely seems a specific setting for this use-case, and there are others that need a slightly different setting. If you mix those two on the same cluster, then having only one global setting to adjust this seems restrictive? Should this be a setting per table, like the flush size?

          I think this is a better default, not that it's a one-size setting. I agree that this should toggleable on a per-CF basis, hence HBASE-5335.

          Show
          Nicolas Spiegelberg added a comment - @Lars/Stack: note that the number of StoreFiles necessary to store N amount of data is order O(log N) with the existing compaction algorithm. This means that setting the compaction min size to a low value will not result in significantly more files. Furthermore, what's hurting performance is not the amount of files but the size of each file. The extra files will be very small and take up only a minority of the space in the LRU cache. Every time you unnecessarily compact files, you have to repopulate that StoreFile in the LRU cache and get a lot of disk reads in addition to the obvious write increase. This is all to say that I would recommend defaulting it to that low because the downsides are very minimal and the benefit can be substantial IO gains. At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold. Why is this silly? With cache-on-write, the data is still cached in memory. It's just migrated from the MemCache to the BlockCache, which has comparable performance. Furthermore, BlockCache data is compressed, so it then takes up less space. Flushing also minimizes the amount of HLogs and decreases recovery time. Flushing would be bad if it meant we weren't optimally using the global MemStore size, but we currently are. This surely seems a specific setting for this use-case, and there are others that need a slightly different setting. If you mix those two on the same cluster, then having only one global setting to adjust this seems restrictive? Should this be a setting per table, like the flush size? I think this is a better default, not that it's a one-size setting. I agree that this should toggleable on a per-CF basis, hence HBASE-5335 .
          Hide
          stack added a comment -

          @Nicolas I think I follow. I opened HBASE-5461. Let me try it.

          Why is this silly?

          Because I was seeing a plethora of small files a problem but given your explaination above, I think I grok that its not many small files thats the prob; its that w/ the way high min size, our selection was to inclusionary and so we end up doing loads of rewriting.

          Show
          stack added a comment - @Nicolas I think I follow. I opened HBASE-5461 . Let me try it. Why is this silly? Because I was seeing a plethora of small files a problem but given your explaination above, I think I grok that its not many small files thats the prob; its that w/ the way high min size, our selection was to inclusionary and so we end up doing loads of rewriting.
          Hide
          Lars Hofhansl added a comment -

          Thanks for explaining Nicolas.
          I wonder if a good default would be some fraction of the flushsize. Maybe 1/4*flushsize, or something.

          Show
          Lars Hofhansl added a comment - Thanks for explaining Nicolas. I wonder if a good default would be some fraction of the flushsize. Maybe 1/4*flushsize, or something.
          Hide
          Himanshu Vashishtha added a comment -

          This is a useful feature; I'm working on it.

          Show
          Himanshu Vashishtha added a comment - This is a useful feature; I'm working on it.
          Hide
          Sergey Shelukhin added a comment -

          Any update since last comment?

          Show
          Sergey Shelukhin added a comment - Any update since last comment?
          Hide
          Himanshu Vashishtha added a comment -

          I figured that it increases mttr time. I will probably look into it after we fixed mttr issues of late. Un-assigning it for the meanwhile.

          Show
          Himanshu Vashishtha added a comment - I figured that it increases mttr time. I will probably look into it after we fixed mttr issues of late. Un-assigning it for the meanwhile.
          Hide
          Gaurav Menghani added a comment -

          Assigning this to myself, since I have committed this change in the 0.89-fb branch. Will upload the patch shortly.

          Show
          Gaurav Menghani added a comment - Assigning this to myself, since I have committed this change in the 0.89-fb branch. Will upload the patch shortly.
          Hide
          stack added a comment -
          Show
          stack added a comment - Thanks Gaurav Menghani
          Hide
          Gaurav Menghani added a comment -

          Patch for the 0.89-fb version.

          Show
          Gaurav Menghani added a comment - Patch for the 0.89-fb version.
          Hide
          Gaurav Menghani added a comment -

          This diff is only for the 0.89-fb branch, I will port this to master soon.

          Show
          Gaurav Menghani added a comment - This diff is only for the 0.89-fb branch, I will port this to master soon.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12610129/Per-CF-Memstore-Flush.diff
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/7618//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12610129/Per-CF-Memstore-Flush.diff against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 12 new or modified tests. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/7618//console This message is automatically generated.
          Hide
          Gaurav Menghani added a comment -

          The basic idea is to be able to maintain the smallest LSN amongst the edits present in a particular memstore for a column family. When we decide to flush a set of memstores, we find the smallest LSN id amongst the memstores that we are not flushing, say X, and say that we can remove the logs for any edits with LSN less than X. We choose a particular memstore to be flushed, if it occupies more than 't' bytes, when the global memstore size threshold is 'T' (and t/T = 1/4 for our configuration). If there is no memstore with >= t bytes but the total size of all the memstores is above T, we flush all the memstores.

          Show
          Gaurav Menghani added a comment - The basic idea is to be able to maintain the smallest LSN amongst the edits present in a particular memstore for a column family. When we decide to flush a set of memstores, we find the smallest LSN id amongst the memstores that we are not flushing, say X, and say that we can remove the logs for any edits with LSN less than X. We choose a particular memstore to be flushed, if it occupies more than 't' bytes, when the global memstore size threshold is 'T' (and t/T = 1/4 for our configuration). If there is no memstore with >= t bytes but the total size of all the memstores is above T, we flush all the memstores.
          Hide
          Ted Yu added a comment -

          Unfinished port to trunk.

          There is no log.appendNoSync() call in 89-fb. So that's difference we need to resolve.

          Talking with Gaurav, I think it would be nice to provide different heuristics (policies) for which column families to flush when there is no single column family whose size exceeds threshold t.

          Show
          Ted Yu added a comment - Unfinished port to trunk. There is no log.appendNoSync() call in 89-fb. So that's difference we need to resolve. Talking with Gaurav, I think it would be nice to provide different heuristics (policies) for which column families to flush when there is no single column family whose size exceeds threshold t.
          Hide
          Gaurav Menghani added a comment -

          Ted has volunteered to port this to trunk in a separate JIRA. I will be working on different heuristics to see the benefits that we get.

          Show
          Gaurav Menghani added a comment - Ted has volunteered to port this to trunk in a separate JIRA. I will be working on different heuristics to see the benefits that we get.
          Hide
          stack added a comment -

          Gaurav Menghani Gaurav, have you deployed this change? If so, what do you see in operation? When you talk about different heuristics, what you thinking? Thanks boss.

          Show
          stack added a comment - Gaurav Menghani Gaurav, have you deployed this change? If so, what do you see in operation? When you talk about different heuristics, what you thinking? Thanks boss.
          Hide
          Gaurav Menghani added a comment -

          Stack Yes, we have deployed this, with selective flushing disabled for now, since we didn't see any aggregate benefits yet. The heuristics that I was thinking about were around, which column families to flush when there are no column families above the threshold for flushing families. Eg. if the memstore limit is 128 MB, and the flushing threshold for a CF is 32 MB, there might be a case, where there are like 7-8 CFs, and none of them are above 32 MB.

          In that case, there are a couple of heuristics you can choose. Like: flush the top N column families, flush only as few column families to free up 1/4 th of the memstore, etc. The main benefit I see is the time spent while compacting the smaller CFs will be much lesser, since the number of files created would be much lesser. This is compensated against bigger column families being flushed earlier than before, and having smaller files than without this change, but with the right heuristics we can find a good balance.

          Show
          Gaurav Menghani added a comment - Stack Yes, we have deployed this, with selective flushing disabled for now, since we didn't see any aggregate benefits yet. The heuristics that I was thinking about were around, which column families to flush when there are no column families above the threshold for flushing families. Eg. if the memstore limit is 128 MB, and the flushing threshold for a CF is 32 MB, there might be a case, where there are like 7-8 CFs, and none of them are above 32 MB. In that case, there are a couple of heuristics you can choose. Like: flush the top N column families, flush only as few column families to free up 1/4 th of the memstore, etc. The main benefit I see is the time spent while compacting the smaller CFs will be much lesser, since the number of files created would be much lesser. This is compensated against bigger column families being flushed earlier than before, and having smaller files than without this change, but with the right heuristics we can find a good balance.

            People

            • Assignee:
              Gaurav Menghani
              Reporter:
              Karthik Ranganathan
            • Votes:
              5 Vote for this issue
              Watchers:
              48 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development