HBase
  1. HBase
  2. HBASE-4365

Add a decent heuristic for region size

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.92.1, 0.94.0
    • Fix Version/s: 0.94.0
    • Component/s: None
    • Labels:
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Changes default splitting policy from ConstantSizeRegionSplitPolicy to IncreasingToUpperBoundRegionSplitPolicy. Splits quickly initially slowing as the number of regions climbs.

      Split size is the number of regions that are on this server that all are of the same table, squared, times the region flush size OR the maximum region split size, whichever is smaller. For example, if the flush size is 128M, then on first flush we will split which will make two regions that will split when their size is 2 * 2 * 128M = 512M. If one of these regions splits, then there are three regions and now the split size is 3 * 3 * 128M = 1152M, and so on until we reach the configured maximum filesize and then from there on out, we'll use that.

      Be warned, this new default could bring on lots of splits if you have many tables on your cluster. Either go back to to the old split policy or up the lower bound configuration.

      This patch changes the default split size from 64M to 128M. It makes the region eventual split size, hbase.hregion.max.filesize, 10G (It was 1G).
      Show
      Changes default splitting policy from ConstantSizeRegionSplitPolicy to IncreasingToUpperBoundRegionSplitPolicy. Splits quickly initially slowing as the number of regions climbs. Split size is the number of regions that are on this server that all are of the same table, squared, times the region flush size OR the maximum region split size, whichever is smaller. For example, if the flush size is 128M, then on first flush we will split which will make two regions that will split when their size is 2 * 2 * 128M = 512M. If one of these regions splits, then there are three regions and now the split size is 3 * 3 * 128M = 1152M, and so on until we reach the configured maximum filesize and then from there on out, we'll use that. Be warned, this new default could bring on lots of splits if you have many tables on your cluster. Either go back to to the old split policy or up the lower bound configuration. This patch changes the default split size from 64M to 128M. It makes the region eventual split size, hbase.hregion.max.filesize, 10G (It was 1G).

      Description

      A few of us were brainstorming this morning about what the default region size should be. There were a few general points made:

      • in some ways it's better to be too-large than too-small, since you can always split a table further, but you can't merge regions currently
      • with HFile v2 and multithreaded compactions there are fewer reasons to avoid very-large regions (10GB+)
      • for small tables you may want a small region size just so you can distribute load better across a cluster
      • for big tables, multi-GB is probably best
      1. 4365-v5.txt
        16 kB
        stack
      2. 4365-v4.txt
        16 kB
        stack
      3. 4365-v3.txt
        14 kB
        stack
      4. 4365-v2.txt
        12 kB
        stack
      5. 4365.txt
        12 kB
        stack

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2669 (See https://builds.apache.org/job/HBase-TRUNK/2669/)
          HBASE-4365 Add a decent heuristic for region size (Revision 1293099)

          Result = SUCCESS
          stack :
          Files :

          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HConstants.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HTableDescriptor.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/IncreasingToUpperBoundRegionSplitPolicy.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/KeyPrefixRegionSplitPolicy.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/RegionSplitPolicy.java
          • /hbase/trunk/src/main/resources/hbase-default.xml
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionSplitPolicy.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2669 (See https://builds.apache.org/job/HBase-TRUNK/2669/ ) HBASE-4365 Add a decent heuristic for region size (Revision 1293099) Result = SUCCESS stack : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HTableDescriptor.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/IncreasingToUpperBoundRegionSplitPolicy.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/KeyPrefixRegionSplitPolicy.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/RegionSplitPolicy.java /hbase/trunk/src/main/resources/hbase-default.xml /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionSplitPolicy.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #122 (See https://builds.apache.org/job/HBase-TRUNK-security/122/)
          HBASE-4365 Add a decent heuristic for region size (Revision 1293099)

          Result = FAILURE
          stack :
          Files :

          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HConstants.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HTableDescriptor.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/IncreasingToUpperBoundRegionSplitPolicy.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/KeyPrefixRegionSplitPolicy.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/RegionSplitPolicy.java
          • /hbase/trunk/src/main/resources/hbase-default.xml
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionSplitPolicy.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #122 (See https://builds.apache.org/job/HBase-TRUNK-security/122/ ) HBASE-4365 Add a decent heuristic for region size (Revision 1293099) Result = FAILURE stack : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HTableDescriptor.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/IncreasingToUpperBoundRegionSplitPolicy.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/KeyPrefixRegionSplitPolicy.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/RegionSplitPolicy.java /hbase/trunk/src/main/resources/hbase-default.xml /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionSplitPolicy.java
          Hide
          Jean-Daniel Cryans added a comment -

          Oh and no concurrent mode failures, as I don't use dumb configurations. Also my ZK timeout is set to 20s.

          Show
          Jean-Daniel Cryans added a comment - Oh and no concurrent mode failures, as I don't use dumb configurations. Also my ZK timeout is set to 20s.
          Hide
          Jean-Daniel Cryans added a comment -

          FWIW running a 5TB upload took 18h.

          Show
          Jean-Daniel Cryans added a comment - FWIW running a 5TB upload took 18h.
          Hide
          Lars Hofhansl added a comment -

          Thanks for also changing KeyPrefixRegionSplitPolicy Stack.
          This is great, I'll deploy this to our test cluster next week.

          Show
          Lars Hofhansl added a comment - Thanks for also changing KeyPrefixRegionSplitPolicy Stack. This is great, I'll deploy this to our test cluster next week.
          Hide
          stack added a comment -

          Committed to trunk. Thanks for reviews lads and testing j-d

          Show
          stack added a comment - Committed to trunk. Thanks for reviews lads and testing j-d
          Hide
          Lars Hofhansl added a comment -

          +1 on V5

          Show
          Lars Hofhansl added a comment - +1 on V5
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12515889/4365-v5.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -136 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 153 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.mapreduce.TestImportTsv
          org.apache.hadoop.hbase.mapred.TestTableMapReduce
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1039//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1039//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1039//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515889/4365-v5.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -136 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 153 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1039//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1039//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1039//console This message is automatically generated.
          Hide
          stack added a comment -

          Fix test (wasn't accomodating of square of the number of regions) and address Ted comment.

          Show
          stack added a comment - Fix test (wasn't accomodating of square of the number of regions) and address Ted comment.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12515883/4365-v4.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -136 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 153 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.TestRegionSplitPolicy

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1038//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1038//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1038//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515883/4365-v4.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -136 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 153 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestRegionSplitPolicy Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1038//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1038//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1038//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          Minor comment:

          +      // If any of the stores are unable to split (eg they contain reference files)
          

          Should read 'of the stores is unable'

          Please also fix the failed test.

          Show
          Ted Yu added a comment - Minor comment: + // If any of the stores are unable to split (eg they contain reference files) Should read 'of the stores is unable' Please also fix the failed test.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12515877/4365-v3.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -136 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 153 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.TestRegionSplitPolicy

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1037//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1037//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1037//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515877/4365-v3.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -136 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 153 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestRegionSplitPolicy Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1037//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1037//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1037//console This message is automatically generated.
          Hide
          stack added a comment -

          Your wish is my command Lars.

          Also addressed Ted comments made earlier that I'd forgotten to fix

          Show
          stack added a comment - Your wish is my command Lars. Also addressed Ted comments made earlier that I'd forgotten to fix
          Hide
          Lars Hofhansl added a comment -

          Wanna have KeyPrefixRegionSplitPolicy extend this new policy (rather than ConstantSizeRegionSplitPolicy)?

          Show
          Lars Hofhansl added a comment - Wanna have KeyPrefixRegionSplitPolicy extend this new policy (rather than ConstantSizeRegionSplitPolicy)?
          Hide
          stack added a comment -

          This version sets the default split policy to be the new one and ups the max file size to 10G from 1G. This is what I'll commit unless objection. It does square of the number of regions * flushsize.

          Show
          stack added a comment - This version sets the default split policy to be the new one and ups the max file size to 10G from 1G. This is what I'll commit unless objection. It does square of the number of regions * flushsize.
          Hide
          Lars Hofhansl added a comment -

          Ah nice

          Show
          Lars Hofhansl added a comment - Ah nice
          Hide
          Jean-Daniel Cryans added a comment -

          Ah sorry I misunderstood your question. I meant that without the patch there's a lot more blocking. With the patch there's more region so more compactions happening in parallel.

          Show
          Jean-Daniel Cryans added a comment - Ah sorry I misunderstood your question. I meant that without the patch there's a lot more blocking. With the patch there's more region so more compactions happening in parallel.
          Hide
          Lars Hofhansl added a comment -

          @Stack: Yeah, power of three is probably too much.
          In the last comment I was referring to the upper bound of 3 regions per regionserver that you suggested, at which point we set the split size to region size.

          @J-D: so in some cases we increase latency with this patch? Not saying it's a problem just curious.

          +1 on patch.

          Show
          Lars Hofhansl added a comment - @Stack: Yeah, power of three is probably too much. In the last comment I was referring to the upper bound of 3 regions per regionserver that you suggested, at which point we set the split size to region size. @J-D: so in some cases we increase latency with this patch? Not saying it's a problem just curious. +1 on patch.
          Hide
          stack added a comment -

          @Lars I think power of three will have us lose some of the fast fan out of regions. I suggest we commit with power of two and 10G split boundary for this issue and look to other policies to improve on this basic win.

          Show
          stack added a comment - @Lars I think power of three will have us lose some of the fast fan out of regions. I suggest we commit with power of two and 10G split boundary for this issue and look to other policies to improve on this basic win.
          Hide
          Jean-Daniel Cryans added a comment -

          One question I had: Did you observe write blocking - due to the number of store files - more frequently than without the patch (because with the patch we tend to get more store-files).

          I does happen a lot more in the beginning, growing out of the first few regions is really hard.

          Show
          Jean-Daniel Cryans added a comment - One question I had: Did you observe write blocking - due to the number of store files - more frequently than without the patch (because with the patch we tend to get more store-files). I does happen a lot more in the beginning, growing out of the first few regions is really hard.
          Hide
          Jean-Daniel Cryans added a comment -

          One question I had: Did you observe write blocking - due to the number of store files - more frequently than without the patch (because with the patch we tend to get more store-files).

          I does happen a lot more in the beginning, growing out of the first few regions is really hard.

          Show
          Jean-Daniel Cryans added a comment - One question I had: Did you observe write blocking - due to the number of store files - more frequently than without the patch (because with the patch we tend to get more store-files). I does happen a lot more in the beginning, growing out of the first few regions is really hard.
          Hide
          Lars Hofhansl added a comment -

          @Stack: I trust J-D's test far more than my (untested) intuition
          I do like the upper bound of 3, though.

          @J-D: Wow.
          One question I had: Did you observe write blocking - due to the number of store files - more frequently than without the patch (because with the patch we tend to get more store-files).

          Show
          Lars Hofhansl added a comment - @Stack: I trust J-D's test far more than my (untested) intuition I do like the upper bound of 3, though. @J-D: Wow. One question I had: Did you observe write blocking - due to the number of store files - more frequently than without the patch (because with the patch we tend to get more store-files).
          Hide
          stack added a comment -

          @Lars You want to put an upper bound on the number of regions?

          I think if we do power of three, we'll lose some of the benefit J-D sees above; we'll fan out the regions slower.

          Do you want to put an upper bound on the number of regions per regionserver for a table? Say, three? As in, when we get to three regions on a server, just scoot the split size up to the maximum. So, given a power of two, we'd split on first flush, then the next split would happen at (2*2*128M) 512M, then 9*128M=1.2G and thereafter we'd split at the max, say 10G?

          Or should we just commit this for now and do more in another patch?

          Show
          stack added a comment - @Lars You want to put an upper bound on the number of regions? I think if we do power of three, we'll lose some of the benefit J-D sees above; we'll fan out the regions slower. Do you want to put an upper bound on the number of regions per regionserver for a table? Say, three? As in, when we get to three regions on a server, just scoot the split size up to the maximum. So, given a power of two, we'd split on first flush, then the next split would happen at (2*2*128M) 512M, then 9*128M=1.2G and thereafter we'd split at the max, say 10G? Or should we just commit this for now and do more in another patch?
          Hide
          Todd Lipcon added a comment -

          Great results! Very cool.

          Show
          Todd Lipcon added a comment - Great results! Very cool.
          Hide
          Jean-Daniel Cryans added a comment -

          Conclusion for the 1TB upload:

          Flush size: 512MB
          Split size: 20GB

          Without patch:
          18012s

          With patch:
          12505s

          It's 1.44x better, so a huge improvement. The difference here is due to the fact that it takes an awfully long time to split the first few regions without the patch. In the past I was starting the test with a smaller split size and then once I got a good distribution I was doing an online alter to set it to 20GB. Not anymore with this patch

          Another observation: the upload in general is slowed down by "too many store files" blocking. I could trace this to compactions taking a long time to get rid of reference files (3.5GB taking more than 10 minutes) and during that time you can hit the block multiple times. We really ought to see how we can optimize the compactions, consider compacting those big files in many threads instead of only one, and enable referencing reference files to skip some compactions altogether.

          Show
          Jean-Daniel Cryans added a comment - Conclusion for the 1TB upload: Flush size: 512MB Split size: 20GB Without patch: 18012s With patch: 12505s It's 1.44x better, so a huge improvement. The difference here is due to the fact that it takes an awfully long time to split the first few regions without the patch. In the past I was starting the test with a smaller split size and then once I got a good distribution I was doing an online alter to set it to 20GB. Not anymore with this patch Another observation: the upload in general is slowed down by "too many store files" blocking. I could trace this to compactions taking a long time to get rid of reference files (3.5GB taking more than 10 minutes) and during that time you can hit the block multiple times. We really ought to see how we can optimize the compactions, consider compacting those big files in many threads instead of only one, and enable referencing reference files to skip some compactions altogether.
          Hide
          Ted Yu added a comment -

          The above concern makes sense.
          We'd better make this change before branching 0.94

          Show
          Ted Yu added a comment - The above concern makes sense. We'd better make this change before branching 0.94
          Hide
          Lars Hofhansl added a comment -

          Also I wonder now whether it is time to separate the part of RegionSplitPolicy that decided when to split (shouldSplit(...)) from the part that decides where to split (getSplitPoint(...)).

          Thinking about HBASE-5304, where we want to control where a region split but don't care when it is split.

          Show
          Lars Hofhansl added a comment - Also I wonder now whether it is time to separate the part of RegionSplitPolicy that decided when to split (shouldSplit(...)) from the part that decides where to split (getSplitPoint(...)). Thinking about HBASE-5304 , where we want to control where a region split but don't care when it is split.
          Hide
          Lars Hofhansl added a comment -

          I like N^3. In the scenarios above it would lead to 3 and 4 region (resp) before we reach 10gb.
          Could even be more radical and say: If we see 1 region on this regionserver we split at flushsize, if we see 2 or more we split at region size.
          (this assumes that if a region server see 2 regions of the same table it's likely to be large)

          Show
          Lars Hofhansl added a comment - I like N^3. In the scenarios above it would lead to 3 and 4 region (resp) before we reach 10gb. Could even be more radical and say: If we see 1 region on this regionserver we split at flushsize, if we see 2 or more we split at region size. (this assumes that if a region server see 2 regions of the same table it's likely to be large)
          Hide
          Jean-Daniel Cryans added a comment -

          Alright I did a macrotest with 100GB.

          Configurations:
          Good old 15 machines test cluster (1 master), 2xquad, 14GB given to HBase, 4x SATA.
          The table is configured to flush at 256MB, split at 2GB.
          40 clients that use a 12MB buffer, collocated on the RS.
          Higher threshold for compactions.

          Without patch:
          1558s

          With patch:
          1457s

          1.07x improvement.

          Then what I saw is that once we've split a few times and that the load got balanced, the performance is exactly the same. That's expected. Also it seems that my split-after-flush patch also goes into full effect.

          I'm running another experiment right now uploading 1TB with flush set at 512MB and split at 20GB. I assume an even bigger difference. The reason to use 20GB is that with bigger data sets you need bigger regions, and starting such a load from scratch is currently horrible but this is what this jira is about.

          Show
          Jean-Daniel Cryans added a comment - Alright I did a macrotest with 100GB. Configurations: Good old 15 machines test cluster (1 master), 2xquad, 14GB given to HBase, 4x SATA. The table is configured to flush at 256MB, split at 2GB. 40 clients that use a 12MB buffer, collocated on the RS. Higher threshold for compactions. Without patch: 1558s With patch: 1457s 1.07x improvement. Then what I saw is that once we've split a few times and that the load got balanced, the performance is exactly the same. That's expected. Also it seems that my split-after-flush patch also goes into full effect. I'm running another experiment right now uploading 1TB with flush set at 512MB and split at 20GB. I assume an even bigger difference. The reason to use 20GB is that with bigger data sets you need bigger regions, and starting such a load from scratch is currently horrible but this is what this jira is about.
          Hide
          Todd Lipcon added a comment -

          OK

          Show
          Todd Lipcon added a comment - OK
          Hide
          Jean-Daniel Cryans added a comment -

          Got any comparison numbers for total import time, for say 100G load?

          Not yet, but I can definitely see that it jumpstarts the import.

          Would be good to know that the new heuristic is definitely advantageous.

          It is, I don't need numbers to tell you that.

          Show
          Jean-Daniel Cryans added a comment - Got any comparison numbers for total import time, for say 100G load? Not yet, but I can definitely see that it jumpstarts the import. Would be good to know that the new heuristic is definitely advantageous. It is, I don't need numbers to tell you that.
          Hide
          Todd Lipcon added a comment -

          Got any comparison numbers for total import time, for say 100G load? Would be good to know that the new heuristic is definitely advantageous.

          Show
          Todd Lipcon added a comment - Got any comparison numbers for total import time, for say 100G load? Would be good to know that the new heuristic is definitely advantageous.
          Hide
          Jean-Daniel Cryans added a comment -

          The latest patch is looking good on my test cluster, will let the import finish before giving my +1 tho.

          Show
          Jean-Daniel Cryans added a comment - The latest patch is looking good on my test cluster, will let the import finish before giving my +1 tho.
          Hide
          stack added a comment -

          Make it the square of the count of regions.

          Address also a problem found by j-d where I was getting region size from conf instead of from HTD.

          This patch works on trunk only. Will need to do a version for 0.92.

          Show
          stack added a comment - Make it the square of the count of regions. Address also a problem found by j-d where I was getting region size from conf instead of from HTD. This patch works on trunk only. Will need to do a version for 0.92.
          Hide
          stack added a comment -

          If I understand correctly a regionserver would still split at a size < 10gb until there about 900 regions for the table (assuming somewhat even distribution).

          Well each split would take longer because the threshold will have grown closer to the 10GB, but yeah. And I think this is what we want. Doing to the power of 3 would make us rise to the 10GB faster. We'd split on first flush then at

          This is probably ok. More regions means that we'll fan out regions over the cluster a little faster. We'll have 9 regions for a table on each server which is probably too many still. We could do to the power of 3 so we'd split on first flush, then at 1G, 3.4G, 8.2G and then we'd be at our 10G limit.

          Show
          stack added a comment - If I understand correctly a regionserver would still split at a size < 10gb until there about 900 regions for the table (assuming somewhat even distribution). Well each split would take longer because the threshold will have grown closer to the 10GB, but yeah. And I think this is what we want. Doing to the power of 3 would make us rise to the 10GB faster. We'd split on first flush then at This is probably ok. More regions means that we'll fan out regions over the cluster a little faster. We'll have 9 regions for a table on each server which is probably too many still. We could do to the power of 3 so we'd split on first flush, then at 1G, 3.4G, 8.2G and then we'd be at our 10G limit.
          Hide
          Lars Hofhansl added a comment -

          Was thinking that before each region server reaches 9 (or even just 5) regions for a table we'd have a lot of regions.

          Say I have 10gb regionsize and 128mb flushsize and 100 regionservers.
          If I understand correctly a regionserver would still split at a size < 10gb until there about 900 regions for the table (assuming somewhat even distribution).

          Maybe this is good?
          I guess ideally we'd get to about 100 regions and then just grow them unless they reach 10gb... Maybe even less regions if there're many tables.

          (As I said above I might not have grokked this correctly)

          Show
          Lars Hofhansl added a comment - Was thinking that before each region server reaches 9 (or even just 5) regions for a table we'd have a lot of regions. Say I have 10gb regionsize and 128mb flushsize and 100 regionservers. If I understand correctly a regionserver would still split at a size < 10gb until there about 900 regions for the table (assuming somewhat even distribution). Maybe this is good? I guess ideally we'd get to about 100 regions and then just grow them unless they reach 10gb... Maybe even less regions if there're many tables. (As I said above I might not have grokked this correctly)
          Hide
          stack added a comment -

          Wouldn't we potentially do a lot of splitting when there are many regionservers?

          Each regionserver would split with the same growing reluctance. Don't we want a bunch of splitting when lots of regionservers so they all get some amount of the incoming load promptly?

          This issue is about getting us to split fast at the start of a bulk load but then having the splitting fall off as more data made it in.

          I'm thinking our default regionsize should be 10G. I should add this to the this patch.

          I don't get what you are saying on the end Lars. Is it good or bad that there are 5 regions on a regionserver before we get to the max size? Balancer will cut in and move regions to other servers and they'll then split eagerly at first with rising reluctance.

          Show
          stack added a comment - Wouldn't we potentially do a lot of splitting when there are many regionservers? Each regionserver would split with the same growing reluctance. Don't we want a bunch of splitting when lots of regionservers so they all get some amount of the incoming load promptly? This issue is about getting us to split fast at the start of a bulk load but then having the splitting fall off as more data made it in. I'm thinking our default regionsize should be 10G. I should add this to the this patch. I don't get what you are saying on the end Lars. Is it good or bad that there are 5 regions on a regionserver before we get to the max size? Balancer will cut in and move regions to other servers and they'll then split eagerly at first with rising reluctance.
          Hide
          Lars Hofhansl added a comment -

          Never use a calculator... 10gb/128mb = 80, 5bgb/256mb = 20.

          Show
          Lars Hofhansl added a comment - Never use a calculator... 10gb/128mb = 80, 5bgb/256mb = 20.
          Hide
          Lars Hofhansl added a comment - - edited

          Wouldn't we potentially do a lot of splitting when there are many regionservers?
          (Maybe I am not grokking this fully)

          If we take the square of the of the number of regions, and say we have 10gb regions and flush size of 128mb, we'd reach the 10gb after at 9 regions of the table on the same regionserver.
          We were planning a region size of 5gb and flush size of 256mb, that would still be 5 regions.
          (10gb/128mb ~ 78, 5gb/256mb ~ 19)

          Show
          Lars Hofhansl added a comment - - edited Wouldn't we potentially do a lot of splitting when there are many regionservers? (Maybe I am not grokking this fully) If we take the square of the of the number of regions, and say we have 10gb regions and flush size of 128mb, we'd reach the 10gb after at 9 regions of the table on the same regionserver. We were planning a region size of 5gb and flush size of 256mb, that would still be 5 regions. (10gb/128mb ~ 78, 5gb/256mb ~ 19)
          Hide
          stack added a comment -

          Making it so we consider this for 0.92.1 – its usability. Will try on cluster w/ square of the number of regions.

          Show
          stack added a comment - Making it so we consider this for 0.92.1 – its usability. Will try on cluster w/ square of the number of regions.
          Hide
          stack added a comment -

          Chatting w/ J-D, probably less disruptive if we do square of the count of regions on a regionserver so we get to max size faster (then there'll be less regions created overall by this phenomeon).

          Show
          stack added a comment - Chatting w/ J-D, probably less disruptive if we do square of the count of regions on a regionserver so we get to max size faster (then there'll be less regions created overall by this phenomeon).
          Hide
          stack added a comment -

          I can add the above changes (will fix the superclass from where I copied this stuff too) but I'm more interested in feedback along the lines of whether folks think we should put this in as default split policy. If so, will then spend time on it trying it on cluster, otherwise not.

          Show
          stack added a comment - I can add the above changes (will fix the superclass from where I copied this stuff too) but I'm more interested in feedback along the lines of whether folks think we should put this in as default split policy. If so, will then spend time on it trying it on cluster, otherwise not.
          Hide
          Ted Yu added a comment -

          Thanks for the patch.

          +    boolean force = region.shouldForceSplit();
          

          shouldSplit() can return immediately if force is true, right ?

          +        foundABigStore = true;
          +      }
          

          We can break out of the for loop when foundABigStore becomes true.

          Validation in real cluster is appreciated.

          Show
          Ted Yu added a comment - Thanks for the patch. + boolean force = region.shouldForceSplit(); shouldSplit() can return immediately if force is true, right ? + foundABigStore = true ; + } We can break out of the for loop when foundABigStore becomes true. Validation in real cluster is appreciated.
          Hide
          stack added a comment -

          Here is a first cut.

          It does not do the lookup of regions in a table across the cluster nor query zk to find out how many nodes are in the mix. Its kinda hard to do this from a RegionSplitPolicy context.

          Instead, we count the number of regions that belong to a table on the current regionserver. We then multiply the flushsize by this number and thats when we'll split. If the multiplication produces a number > max filesize for a region, we'll take maxfilesize.

          If 1 region for a given table on a regionserver, we'll split on the first flush.

          If 5 regions from same table on a regionserver, we'll split at 5 * 128M and so on.

          We could have the size grow more aggressively by squaring the count of regions; that might make sense if cluster has lots of small tables – in fact it might be better altogether. What you all think?

          If agreeable, I should make a new patch that makes this the default splitting policy.

          Show
          stack added a comment - Here is a first cut. It does not do the lookup of regions in a table across the cluster nor query zk to find out how many nodes are in the mix. Its kinda hard to do this from a RegionSplitPolicy context. Instead, we count the number of regions that belong to a table on the current regionserver. We then multiply the flushsize by this number and thats when we'll split. If the multiplication produces a number > max filesize for a region, we'll take maxfilesize. If 1 region for a given table on a regionserver, we'll split on the first flush. If 5 regions from same table on a regionserver, we'll split at 5 * 128M and so on. We could have the size grow more aggressively by squaring the count of regions; that might make sense if cluster has lots of small tables – in fact it might be better altogether. What you all think? If agreeable, I should make a new patch that makes this the default splitting policy.
          Hide
          Ted Yu added a comment -

          +1 on Todd's idea.
          Minor comment: the pluggable policy would always represent a class name. We can devise a default policy, say ConstantMaxRegionSizePolicy, which returns the value of hbase.hregion.max.filesize

          Show
          Ted Yu added a comment - +1 on Todd's idea. Minor comment: the pluggable policy would always represent a class name. We can devise a default policy, say ConstantMaxRegionSizePolicy, which returns the value of hbase.hregion.max.filesize
          Hide
          Todd Lipcon added a comment -

          Hmm, that would suggest a heuristic based not on number of regions, but based on total table size. However, it seems like a bit of an edge case.

          Perhaps we can make this a pluggable policy like so: allow max region size to be either a class name or an integer. If it's a class name, it refers to an implementation of some interface like MaxRegionSizeCalculator. If it's an integer, it acts the same as today (fixed size). Then we could easily experiment with different heuristics.

          Show
          Todd Lipcon added a comment - Hmm, that would suggest a heuristic based not on number of regions, but based on total table size. However, it seems like a bit of an edge case. Perhaps we can make this a pluggable policy like so: allow max region size to be either a class name or an integer. If it's a class name, it refers to an implementation of some interface like MaxRegionSizeCalculator . If it's an integer, it acts the same as today (fixed size). Then we could easily experiment with different heuristics.
          Hide
          Ted Yu added a comment -

          For pre-splitting tables, consider this scenario:
          The pre-split regions didn't represent the actual distribution of row keys for the underlying table. Meaning, relatively low number of regions receive the writes initially.
          Maybe splitting these regions relatively fast would achieve better performance.

          Show
          Ted Yu added a comment - For pre-splitting tables, consider this scenario: The pre-split regions didn't represent the actual distribution of row keys for the underlying table. Meaning, relatively low number of regions receive the writes initially. Maybe splitting these regions relatively fast would achieve better performance.
          Hide
          Todd Lipcon added a comment -

          My comment about load balancer was assuming there're many tables in the cluster.

          I guess I'm confused how this is related to the region size heuristic. This is a general LB concern, but shouldn't be worse/better due to this heuristic, right?

          My second comment originated from our practice of pre-splitting tables. It is possible that R == 5S would be reached soon after the creation of the table for small-medium sized cluster.

          In that case, isn't it a good thing that we'd automatically set the region size to be fairly large? ie if you've pre-split to 5S regions, then you probably don't want it to keeps splitting faster on you.

          Show
          Todd Lipcon added a comment - My comment about load balancer was assuming there're many tables in the cluster. I guess I'm confused how this is related to the region size heuristic. This is a general LB concern, but shouldn't be worse/better due to this heuristic, right? My second comment originated from our practice of pre-splitting tables. It is possible that R == 5S would be reached soon after the creation of the table for small-medium sized cluster. In that case, isn't it a good thing that we'd automatically set the region size to be fairly large? ie if you've pre-split to 5S regions, then you probably don't want it to keeps splitting faster on you.
          Hide
          Ted Yu added a comment -

          I understand the proposal provides better heuristic for determining region size.

          My comment about load balancer was assuming there're many tables in the cluster.

          My second comment originated from our practice of pre-splitting tables. It is possible that R == 5S would be reached soon after the creation of the table for small-medium sized cluster.

          Show
          Ted Yu added a comment - I understand the proposal provides better heuristic for determining region size. My comment about load balancer was assuming there're many tables in the cluster. My second comment originated from our practice of pre-splitting tables. It is possible that R == 5S would be reached soon after the creation of the table for small-medium sized cluster.
          Hide
          Todd Lipcon added a comment -

          Load balancer currently doesn't balance the regions for any single table. We should introduce a policy that does this.

          That seems orthogonal, to me. If you have a single table in the cluster, then you need at least as many regions as servers to make use of all of your servers.

          If you have many tables, then yes, a per-table balancing might be useful (in some cases), but it's the case regardless of whether we have a split size heuristic or manually set region size.

          It seems that the proposal favors not pre-splitting tables. If so, we need some solid performance results to back the proposal.

          Howso? I'm suggesting that we retain the MAX_REGION_SIZE parameter, if you want to manually set it to some value, or set it to MAX_LONG and manually split. But, the default would be this heuristic, which would work well for many use cases.

          Show
          Todd Lipcon added a comment - Load balancer currently doesn't balance the regions for any single table. We should introduce a policy that does this. That seems orthogonal, to me. If you have a single table in the cluster, then you need at least as many regions as servers to make use of all of your servers. If you have many tables, then yes, a per-table balancing might be useful (in some cases), but it's the case regardless of whether we have a split size heuristic or manually set region size. It seems that the proposal favors not pre-splitting tables. If so, we need some solid performance results to back the proposal. Howso? I'm suggesting that we retain the MAX_REGION_SIZE parameter, if you want to manually set it to some value, or set it to MAX_LONG and manually split. But, the default would be this heuristic, which would work well for many use cases.
          Hide
          Ted Yu added a comment -

          There is some assumption for the above discussion.

          for small tables you may want a small region size just so you can distribute load better across a cluster

          Load balancer currently doesn't balance the regions for any single table. We should introduce a policy that does this.

          It seems that the proposal favors not pre-splitting tables. If so, we need some solid performance results to back the proposal.

          Show
          Ted Yu added a comment - There is some assumption for the above discussion. for small tables you may want a small region size just so you can distribute load better across a cluster Load balancer currently doesn't balance the regions for any single table. We should introduce a policy that does this. It seems that the proposal favors not pre-splitting tables. If so, we need some solid performance results to back the proposal.
          Hide
          Todd Lipcon added a comment -

          One idea that I'd like to propose is the following. We would introduce a special value for max region size (eg 0 or -1) which configures HBase to use a heuristic instead of a constant. The heuristic would be:

          • on region-open (and periodically, perhaps on flush or compaction) the RS scans META to find how many regions are in the table. Let's call this R.
          • the region server also periodically scans ZK or asks the master to determine how many RS are in the cluster. Let's call this S.
          • If R < S (ie there are fewer regions than servers), we use a max region size of perhaps 256MB. This is so that early in a table's lifetime, even if not pre-split, the table will split rapidly and be able to spread load across the cluster.
          • If R < 5S, we use a max region size of perhaps 1GB.
          • If R > 5S, we use a max region size of something rather large, like 10GB.

          As an example, if we started putting a 1T table on a cluster with 20 nodes, we'd quickly expand to 20 regions of 256M. The size would switch to 1G, and we'd get another ~100 regions. The size would switch to 10G and we'd get another 100 regions. So at the end, we'd have around 200-250 regions, or about 10 per server.

          The specific numbers above aren't well thought out, but does this seem to be more "foolproof" than any predetermined default.

          Show
          Todd Lipcon added a comment - One idea that I'd like to propose is the following. We would introduce a special value for max region size (eg 0 or -1) which configures HBase to use a heuristic instead of a constant. The heuristic would be: on region-open (and periodically, perhaps on flush or compaction) the RS scans META to find how many regions are in the table. Let's call this R . the region server also periodically scans ZK or asks the master to determine how many RS are in the cluster. Let's call this S . If R < S (ie there are fewer regions than servers), we use a max region size of perhaps 256MB. This is so that early in a table's lifetime, even if not pre-split, the table will split rapidly and be able to spread load across the cluster. If R < 5S , we use a max region size of perhaps 1GB. If R > 5S , we use a max region size of something rather large, like 10GB. As an example, if we started putting a 1T table on a cluster with 20 nodes, we'd quickly expand to 20 regions of 256M. The size would switch to 1G, and we'd get another ~100 regions. The size would switch to 10G and we'd get another 100 regions. So at the end, we'd have around 200-250 regions, or about 10 per server. The specific numbers above aren't well thought out, but does this seem to be more "foolproof" than any predetermined default.

            People

            • Assignee:
              stack
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development