Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.98.0, 0.99.0
    • Component/s: Compaction
    • Labels:
      None

      Description

      So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor.

      And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions.
      It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation).
      It also doesn't break seqNum-based file sorting for any one key.
      It works like this.
      The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below).
      All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files.
      Compaction policy does 3 types of compactions.
      First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped.
      Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it "major compact" the entire stripe. Or just have the ratio and no more complexity.
      Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced.
      It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries.
      There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O.
      If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place.
      In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries.
      Also, unless unbalancing is really large we don't need to rebalance really.
      Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart.

      The end result:

      • many small compactions that can be spread out in time.
      • reads still read from a small number of files (one stripe + L0).
      • region splits become marvelously simple (if we could move files between regions, no references would be needed).
        Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans.
        It also needs no metadata, although we may record some for convenience.
        It also would appear to not cause as much I/O.
      1. stripe-cdf.pdf
        23 kB
        Sergey Shelukhin
      2. Stripe compaction perf evaluation.pdf
        339 kB
        Sergey Shelukhin
      3. Stripe compaction perf evaluation.pdf
        214 kB
        Sergey Shelukhin
      4. Stripe compaction perf evaluation.pdf
        206 kB
        Sergey Shelukhin
      5. Stripe compactions.pdf
        129 kB
        Sergey Shelukhin
      6. Stripe compactions.pdf
        129 kB
        Sergey Shelukhin
      7. Stripe compactions.pdf
        127 kB
        Sergey Shelukhin
      8. Stripe compactions.pdf
        114 kB
        Sergey Shelukhin
      9. Using stripe compactions.pdf
        141 kB
        Sergey Shelukhin
      10. Using stripe compactions.pdf
        138 kB
        Sergey Shelukhin
      11. Using stripe compactions.pdf
        135 kB
        Sergey Shelukhin

        Issue Links

          Activity

          Hide
          Sergey Shelukhin added a comment -

          What do you guys think about the general idea?

          Show
          Sergey Shelukhin added a comment - What do you guys think about the general idea?
          Hide
          Matt Corgan added a comment -

          So a stripe is like a sub-region? In terms of compactions, it sounds like it serves the same purpose as splitting a table into regions, so you can compact a region that is hot while letting cold regions stay cold.

          If this is the case and stripes are allowed to auto-split, it may be very beneficial for time series data. If you had a region approaching 10gb with 100mb stripes, the last stripe would keep splitting and the first 99 would never get touched. The problem without stripes is that the first 9900mb keeps getting re-written even though it never changes.

          Am i understanding it correctly that stripes in a region would act similarly to regions in a table?

          Show
          Matt Corgan added a comment - So a stripe is like a sub-region? In terms of compactions, it sounds like it serves the same purpose as splitting a table into regions, so you can compact a region that is hot while letting cold regions stay cold. If this is the case and stripes are allowed to auto-split, it may be very beneficial for time series data. If you had a region approaching 10gb with 100mb stripes, the last stripe would keep splitting and the first 99 would never get touched. The problem without stripes is that the first 9900mb keeps getting re-written even though it never changes. Am i understanding it correctly that stripes in a region would act similarly to regions in a table?
          Hide
          chunhui shen added a comment -

          Interesting ideas.
          With stripe compaction, we could support many files in one region, each file belongs to one stripe, and no overlap keys cross stripes except LO, is it right?

          I think it is useful for the sequential write scenario.

          if we could move files between regions, no references would be needed)

          Moving files would break snapshots, references are needed all the same

          Show
          chunhui shen added a comment - Interesting ideas. With stripe compaction, we could support many files in one region, each file belongs to one stripe, and no overlap keys cross stripes except LO, is it right? I think it is useful for the sequential write scenario. if we could move files between regions, no references would be needed) Moving files would break snapshots, references are needed all the same
          Hide
          Sergey Shelukhin added a comment -

          Hmm, my thinking would be that number of stripes will be fixed and we would rebalance, but never split as such. Yes, the idea for sequential data plays very well into this, the code will probably be almost the same. With non-seq data, it would just try to achieve effect similar to level.

          Show
          Sergey Shelukhin added a comment - Hmm, my thinking would be that number of stripes will be fixed and we would rebalance, but never split as such. Yes, the idea for sequential data plays very well into this, the code will probably be almost the same. With non-seq data, it would just try to achieve effect similar to level.
          Hide
          Sergey Shelukhin added a comment -

          If there are no general objections I will try to start on this tomorrow, in preference to level... I am looking at some random HCM issues now so patch may be next week provided that HBASE-7516 and HBASE-7603 can go in.

          Show
          Sergey Shelukhin added a comment - If there are no general objections I will try to start on this tomorrow, in preference to level... I am looking at some random HCM issues now so patch may be next week provided that HBASE-7516 and HBASE-7603 can go in.
          Hide
          Matt Corgan added a comment -

          I've been brainstorming something similar for splitting the memstore into stripes, mentioned in https://issues.apache.org/jira/browse/HBASE-3484?focusedCommentId=13410934&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13410934

          I think it's a good idea now that region sizes have become so large. It's easy to have a few hot stripes in a region if it's 10GB, and not necessarily wrong from a primary-key design perspective. It's often very wasteful to be compacting the whole region.

          Show
          Matt Corgan added a comment - I've been brainstorming something similar for splitting the memstore into stripes, mentioned in https://issues.apache.org/jira/browse/HBASE-3484?focusedCommentId=13410934&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13410934 I think it's a good idea now that region sizes have become so large. It's easy to have a few hot stripes in a region if it's 10GB, and not necessarily wrong from a primary-key design perspective. It's often very wasteful to be compacting the whole region.
          Hide
          Jimmy Xiang added a comment -

          A stripe is like a sub-region. That's a good idea.

          Show
          Jimmy Xiang added a comment - A stripe is like a sub-region. That's a good idea.
          Hide
          Matt Corgan added a comment -

          Right now there is a tug-of-war between region size and number of regions.

          You want fewer regions for:

          • server startup/shutdown
          • minimizing RPC calls
          • having fewer/bigger memstore flushes
          • fewer open files per server

          You want more regions for:

          • spreading load among servers
          • having more efficient compactions

          spreading load among servers

          Since BigTable was designed machines have grown tremendously, with many running 24 cpus and 48+GB memory. These machines can serve many regions even if each region is 30GB. So we can achieve good load distribution even with enormous regions.

          having more efficient compactions

          But, huge regions are bad for compactions. 30GB compactino is still considered expensive. Data in HBase is generally not perfectly evenly distributed across a table, or if it is you are not taking full advantage of HBase's sorted architecture. Huge regions therefore have hot and cold stripes/sub-regions. If you compact a 30GB region, you are wasting a ton of time on the cold stripes.

          Overall, I think the sub-region approach cuts the rope in the tug-of-war. It lets you have a smaller number of regions at the same time as having the most efficient compactions.

          Show
          Matt Corgan added a comment - Right now there is a tug-of-war between region size and number of regions. You want fewer regions for: server startup/shutdown minimizing RPC calls having fewer/bigger memstore flushes fewer open files per server You want more regions for: spreading load among servers having more efficient compactions spreading load among servers Since BigTable was designed machines have grown tremendously, with many running 24 cpus and 48+GB memory. These machines can serve many regions even if each region is 30GB. So we can achieve good load distribution even with enormous regions. having more efficient compactions But, huge regions are bad for compactions. 30GB compactino is still considered expensive. Data in HBase is generally not perfectly evenly distributed across a table, or if it is you are not taking full advantage of HBase's sorted architecture. Huge regions therefore have hot and cold stripes/sub-regions. If you compact a 30GB region, you are wasting a ton of time on the cold stripes. Overall, I think the sub-region approach cuts the rope in the tug-of-war. It lets you have a smaller number of regions at the same time as having the most efficient compactions.
          Hide
          Jimmy Xiang added a comment -

          Overall, I think the sub-region approach cuts the rope in the tug-of-war. It lets you have a smaller number of regions at the same time as having the most efficient compactions.

          I agree. I was wondering if it will take longer to open a region if there are many hfiles.

          Show
          Jimmy Xiang added a comment - Overall, I think the sub-region approach cuts the rope in the tug-of-war. It lets you have a smaller number of regions at the same time as having the most efficient compactions. I agree. I was wondering if it will take longer to open a region if there are many hfiles.
          Hide
          stack added a comment -

          Matt Corgan Nice write up. So you are in favor of Sergey's project?

          Jimmy Xiang If I were to guess, it will take longer than the case where we have monolithic files that cover the total region namespace as we currently have (Because there will be more files). If rather than strip'ing inside a region, we instead had a region per stripe, my guess is that stripe'ing will take less time to open since less region machinations going on (less .regioninfos to open, less looking in fs for stuff to clean up since last file open, less listing of storefiles under families).

          Show
          stack added a comment - Matt Corgan Nice write up. So you are in favor of Sergey's project? Jimmy Xiang If I were to guess, it will take longer than the case where we have monolithic files that cover the total region namespace as we currently have (Because there will be more files). If rather than strip'ing inside a region, we instead had a region per stripe, my guess is that stripe'ing will take less time to open since less region machinations going on (less .regioninfos to open, less looking in fs for stuff to clean up since last file open, less listing of storefiles under families).
          Hide
          Matt Corgan added a comment -

          So you are in favor of Sergey's project?

          oh yes, if you could not tell. Thinking of an HBASE-7667 tatoo =)

          One of the few major things hbase is missing in my opinion is the ability to load time-series through the normal api, rather than having to go off and write some separate bulk load code. HBase currently takes a dump when you do that. Main culprits are HBASE-5479 and my comment in HBASE-3484 (https://issues.apache.org/jira/browse/HBASE-3484?focusedCommentId=13410934&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13410934). Even during normal operation as opposed to a one-off import of data, the inefficiencies are still happening, just at a less obvious pace.

          It may be a follow on to this jira, but having "striper" dynamically add stripes at the end of the region would let allow all the stripes before the last one "go cold" which is critical for avoiding hugely wasteful compactions of non-changing data. Ideally, it would be able to allocate small stripes as new data comes in (each flush?) and then later go on to merge older stripes to reduce hfile count (at major compaction time?). With this in place on an N node cluster, you could partition your data with N or 2N regions using a hash prefix and basically let the regions grow infinitely large. Currently I have to limit region size to ~2GB which results in hundreds of regions per node which is a bit of a management hassle because it's beyond human readable, and a bit wasteful with all the empty memstores among other things.

          I do wonder if there's a more accurate name than stripe. Stripes make me think of RAID stripes which is a different concept than sub-regions. Sub-region is not a good name either though.

          It would be cool if you could set a column family attribute like layout=TIME_SERIES which HBase could use to automatically pick the compaction strategy, split-point strategy, balancer strategy, and allow future niceties like using stronger compression on old data.

          Show
          Matt Corgan added a comment - So you are in favor of Sergey's project? oh yes, if you could not tell. Thinking of an HBASE-7667 tatoo =) One of the few major things hbase is missing in my opinion is the ability to load time-series through the normal api, rather than having to go off and write some separate bulk load code. HBase currently takes a dump when you do that. Main culprits are HBASE-5479 and my comment in HBASE-3484 ( https://issues.apache.org/jira/browse/HBASE-3484?focusedCommentId=13410934&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13410934 ). Even during normal operation as opposed to a one-off import of data, the inefficiencies are still happening, just at a less obvious pace. It may be a follow on to this jira, but having "striper" dynamically add stripes at the end of the region would let allow all the stripes before the last one "go cold" which is critical for avoiding hugely wasteful compactions of non-changing data. Ideally, it would be able to allocate small stripes as new data comes in (each flush?) and then later go on to merge older stripes to reduce hfile count (at major compaction time?). With this in place on an N node cluster, you could partition your data with N or 2N regions using a hash prefix and basically let the regions grow infinitely large. Currently I have to limit region size to ~2GB which results in hundreds of regions per node which is a bit of a management hassle because it's beyond human readable, and a bit wasteful with all the empty memstores among other things. I do wonder if there's a more accurate name than stripe. Stripes make me think of RAID stripes which is a different concept than sub-regions. Sub-region is not a good name either though. It would be cool if you could set a column family attribute like layout=TIME_SERIES which HBase could use to automatically pick the compaction strategy, split-point strategy, balancer strategy, and allow future niceties like using stronger compression on old data.
          Hide
          Lars Hofhansl added a comment -

          Stripes can have overlapping keyrange with other stripes, correct? I.e. if two L0 files are compacted with a L0-compation their each L0 is striped but since the L0-files could overlap, the stripes could too with other stripes from different L0 files.

          A nice property in LevelDB is that only L0 files have overlapping keyspaces with other files, all level > L0 have no overlapping keys within a level and no file at level L overlaps more than 10 files at L+1.

          Right now only major compactions can remove delete markers because only by looking at all data you can guarantee that you will see each KV that might be affected by the delete marker.
          It's not clear to me how we get around this, unless we introduce a formal notion of levels and know which L+1 files overlap with a file at level L.

          Would love to discuss more at the PowWow on the Feb 19th.

          Show
          Lars Hofhansl added a comment - Stripes can have overlapping keyrange with other stripes, correct? I.e. if two L0 files are compacted with a L0-compation their each L0 is striped but since the L0-files could overlap, the stripes could too with other stripes from different L0 files. A nice property in LevelDB is that only L0 files have overlapping keyspaces with other files, all level > L0 have no overlapping keys within a level and no file at level L overlaps more than 10 files at L+1. Right now only major compactions can remove delete markers because only by looking at all data you can guarantee that you will see each KV that might be affected by the delete marker. It's not clear to me how we get around this, unless we introduce a formal notion of levels and know which L+1 files overlap with a file at level L. Would love to discuss more at the PowWow on the Feb 19th.
          Hide
          Ted Yu added a comment -

          Sub-region is not a good name either though.

          Other names we can consider: arena, range, realm, section, sector, zone.

          Show
          Ted Yu added a comment - Sub-region is not a good name either though. Other names we can consider: arena, range, realm, section, sector, zone.
          Hide
          Jimmy Xiang added a comment -

          Stripes don't have overlapping keyrange with other stripes. So each stripe is just like a sub-region. L0 files are special, could overlap with multiple stripes.

          Show
          Jimmy Xiang added a comment - Stripes don't have overlapping keyrange with other stripes. So each stripe is just like a sub-region. L0 files are special, could overlap with multiple stripes.
          Hide
          Lars Hofhansl added a comment -

          I see, then a major compaction needs to at least include all current L0 files (in any), right?

          Show
          Lars Hofhansl added a comment - I see, then a major compaction needs to at least include all current L0 files (in any), right?
          Hide
          Jimmy Xiang added a comment - - edited

          That's right. To major compact a stripe, all L0 files, if any, can be compacted and split into stripes, then merge/compact all files belonging to the stripe.

          Show
          Jimmy Xiang added a comment - - edited That's right. To major compact a stripe, all L0 files, if any, can be compacted and split into stripes, then merge/compact all files belonging to the stripe.
          Hide
          Nicolas Spiegelberg added a comment -

          Some thoughts I had about this:

          Overall, I think it's a good idea. Seems like it's not crazy to add and would have multiple benefits. Logical striping across the L1 boundary is a simple solution to both proactively handle splits and reduce compaction times.

          Thoughts on this feature
          1. Fixed configs : in the same way that we got a lot of stability by limiting the regions/server to a fixed number, we might want to similarly limit the number of stripes per region to 10 (or X) instead of "every Y bytes". This will help us understand the benefit we get from striping and it's easy to double the striping and chart the difference.
          2. NameNode pressure : Obviously, a 10x striping factor will cause 10x scaling of the FS. Can we offset this by increasing the HDFS block size, since addBlock dominates at scale? Really, unlike Hadoop, you have all of the HFile or none of it. Missing a portion of the HFile currently invalidates the whole file. You really need 1 HDFS block == 1 HFile. However, we could probably just toy with increasing it by the striping factor right now and seeing if that balances things.
          3. Open Times : I think this will be an issue, specifically on server start. Need to be careful here.
          4. Major compaction : you can perform a major compaction (remove deletes) as long as you have [i,end) contiguous. I don't think you'd need to involve L0 files in an MC at all. Save the complexity. Furthermore, part of the reason why we created the tiered compaction is to prevent small/new files from participating in MC because of cache thrashing, poor minor compactions, and a handful of other reasons.

          So, some thoughts on related pain points we seem to have that tie into this feature:
          1. Reduce Cache thrashing : region moves kill us a lot of time because we have a cold cache. There is a worry that more aggressive compactions mean more thrashing. I think it will actual even this out better since right now a MC causes a lot of churn. Just should be thinking about this if perf after the feature isn't what we desire.
          2. Unnecessary IOPS : outside of this algorithm, we should just completely get rid of the requirement to compact after a split. We have the block cache, so given a [start,end) in the file, we can easily tell our mid point for future splits. There's little reason to aggressively churn in this way after splitting.
          3. Poor locality : for grid topology setups, we should eventually make the striping algorithm a little more intelligent about picking our replicas. If all stripes go to the same secondary & tertiary node, then splits have a very restricted set of servers to chose for datanode locality.

          Show
          Nicolas Spiegelberg added a comment - Some thoughts I had about this: Overall, I think it's a good idea. Seems like it's not crazy to add and would have multiple benefits. Logical striping across the L1 boundary is a simple solution to both proactively handle splits and reduce compaction times. Thoughts on this feature 1. Fixed configs : in the same way that we got a lot of stability by limiting the regions/server to a fixed number, we might want to similarly limit the number of stripes per region to 10 (or X) instead of "every Y bytes". This will help us understand the benefit we get from striping and it's easy to double the striping and chart the difference. 2. NameNode pressure : Obviously, a 10x striping factor will cause 10x scaling of the FS. Can we offset this by increasing the HDFS block size, since addBlock dominates at scale? Really, unlike Hadoop, you have all of the HFile or none of it. Missing a portion of the HFile currently invalidates the whole file. You really need 1 HDFS block == 1 HFile. However, we could probably just toy with increasing it by the striping factor right now and seeing if that balances things. 3. Open Times : I think this will be an issue, specifically on server start. Need to be careful here. 4. Major compaction : you can perform a major compaction (remove deletes) as long as you have [i,end) contiguous. I don't think you'd need to involve L0 files in an MC at all. Save the complexity. Furthermore, part of the reason why we created the tiered compaction is to prevent small/new files from participating in MC because of cache thrashing, poor minor compactions, and a handful of other reasons. So, some thoughts on related pain points we seem to have that tie into this feature: 1. Reduce Cache thrashing : region moves kill us a lot of time because we have a cold cache. There is a worry that more aggressive compactions mean more thrashing. I think it will actual even this out better since right now a MC causes a lot of churn. Just should be thinking about this if perf after the feature isn't what we desire. 2. Unnecessary IOPS : outside of this algorithm, we should just completely get rid of the requirement to compact after a split. We have the block cache, so given a [start,end) in the file, we can easily tell our mid point for future splits. There's little reason to aggressively churn in this way after splitting. 3. Poor locality : for grid topology setups, we should eventually make the striping algorithm a little more intelligent about picking our replicas. If all stripes go to the same secondary & tertiary node, then splits have a very restricted set of servers to chose for datanode locality.
          Hide
          Jimmy Xiang added a comment -

          I don't think you'd need to involve L0 files in an MC at all.

          Agree. The major compaction is not the traditional one any more.

          We can limit the number of stripes per region. But we don't have to start with that many stripes at the beginning.

          One thing I'd like to mention is that auto-split a stripe will be much simpler than auto-split a region; auto-merge couple stripes will be much simpler than auto-merge couple regions too. I'd see we should prefer more stripes than more regions, for a given table.

          Show
          Jimmy Xiang added a comment - I don't think you'd need to involve L0 files in an MC at all. Agree. The major compaction is not the traditional one any more. We can limit the number of stripes per region. But we don't have to start with that many stripes at the beginning. One thing I'd like to mention is that auto-split a stripe will be much simpler than auto-split a region; auto-merge couple stripes will be much simpler than auto-merge couple regions too. I'd see we should prefer more stripes than more regions, for a given table.
          Hide
          Sergey Shelukhin added a comment -

          It may be a follow on to this jira, but having "striper" dynamically add stripes at the end of the region would let allow all the stripes before the last one "go cold" which is critical for avoiding hugely wasteful compactions of non-changing data

          Actually, it can be added as part of the main work, HBASE-7679 (file management) code includes such capabilities.
          I wonder how, no matter the compactions, does region management work for such scenario. Wouldn't all the load always be on last region if you have TS keys?
          Or, if you have artificial partitioning but query by TS, wouldn't all queries go to all servers?

          To major compact a stripe, all L0 files, if any, can be split into stripes, then merge all files belonging to the stripe.

          Can you explain more about the delete marker limitation?
          Suppose in current compaction selection, I choose a set of files starting at the oldest file but not including all files.
          Wouldn't that be enough to process delete markers that delete the updates within those files? Granted, I might not process all delete markers, but I don't have to see all files. E.g. if I only have 3 files with one entry for K each, "K=V", "delete K", "K=V2", and I compact the first two, I can remove entries for K from them, right?

          1. Fixed configs : in the same way that we got a lot of stability by limiting the regions/server to a fixed number, we might want to similarly limit the number of stripes per region to 10 (or X) instead of "every Y bytes". This will help us understand the benefit we get from striping and it's easy to double the striping and chart the difference.

          That is the original idea.

          Thanks for other comments

          Show
          Sergey Shelukhin added a comment - It may be a follow on to this jira, but having "striper" dynamically add stripes at the end of the region would let allow all the stripes before the last one "go cold" which is critical for avoiding hugely wasteful compactions of non-changing data Actually, it can be added as part of the main work, HBASE-7679 (file management) code includes such capabilities. I wonder how, no matter the compactions, does region management work for such scenario. Wouldn't all the load always be on last region if you have TS keys? Or, if you have artificial partitioning but query by TS, wouldn't all queries go to all servers? To major compact a stripe, all L0 files, if any, can be split into stripes, then merge all files belonging to the stripe. Can you explain more about the delete marker limitation? Suppose in current compaction selection, I choose a set of files starting at the oldest file but not including all files. Wouldn't that be enough to process delete markers that delete the updates within those files? Granted, I might not process all delete markers, but I don't have to see all files. E.g. if I only have 3 files with one entry for K each, "K=V", "delete K", "K=V2", and I compact the first two, I can remove entries for K from them, right? 1. Fixed configs : in the same way that we got a lot of stability by limiting the regions/server to a fixed number, we might want to similarly limit the number of stripes per region to 10 (or X) instead of "every Y bytes". This will help us understand the benefit we get from striping and it's easy to double the striping and chart the difference. That is the original idea. Thanks for other comments
          Hide
          stack added a comment -

          In a striped region, if stripes >= 2 and the key distribution is basically even, a split could be done w/o references, halfhfiles, and rewriting from parent to daughter; a split could just rename the parent files into the daughter regions. It could make for split simplification and possibly make for some i/o savings.

          This is a load of great stuff in this issue. Best read I've had in a long time.

          Show
          stack added a comment - In a striped region, if stripes >= 2 and the key distribution is basically even, a split could be done w/o references, halfhfiles, and rewriting from parent to daughter; a split could just rename the parent files into the daughter regions. It could make for split simplification and possibly make for some i/o savings. This is a load of great stuff in this issue. Best read I've had in a long time.
          Hide
          Matt Corgan added a comment -

          Open Times : I think this will be an issue, specifically on server start. Need to be careful here.

          Hopefully could be mitigated by making regions larger, like doubling region size and setting max 2 stripes/region. Theoretically should be able to have the same overall number of files as normal regions, or are there other factors at play?

          Wouldn't all the load always be on last region if you have TS keys? Or, if you have artificial partitioning but query by TS, wouldn't all queries go to all servers?

          An easy strategy for P partitions is to

          • prepend a single byte to each key where prefix=hash(row)%P
          • pre-split the table into P regions
          • tweak the balancer to evenly spread the tail partitions for each region
          • writes get sprayed evenly to all tail partitions
          • a single Get query will only hit one region since you know hash(row)%P beforehand
          • you scan all P partitions using a P-way collating iterator
            • so yes, scans go to all servers but presumably they are huge and would hit lots of data anyway
            • because they are huge, a client that scans the partitions concurrently will be faster
          • a big multi-Get will spray to the exact servers necessary, possibly all of them, but like scans may be faster because done in parallel

          I'm not sure what most people are doing with time series data but this seems like a good approach to me. You basically just choose arbitrarily large P. An MD5 prefix is essentially P=2^128 (I wouldn't recommend pre-splitting at that granularity).

          Show
          Matt Corgan added a comment - Open Times : I think this will be an issue, specifically on server start. Need to be careful here. Hopefully could be mitigated by making regions larger, like doubling region size and setting max 2 stripes/region. Theoretically should be able to have the same overall number of files as normal regions, or are there other factors at play? Wouldn't all the load always be on last region if you have TS keys? Or, if you have artificial partitioning but query by TS, wouldn't all queries go to all servers? An easy strategy for P partitions is to prepend a single byte to each key where prefix=hash(row)%P pre-split the table into P regions tweak the balancer to evenly spread the tail partitions for each region writes get sprayed evenly to all tail partitions a single Get query will only hit one region since you know hash(row)%P beforehand you scan all P partitions using a P-way collating iterator so yes, scans go to all servers but presumably they are huge and would hit lots of data anyway because they are huge, a client that scans the partitions concurrently will be faster a big multi-Get will spray to the exact servers necessary, possibly all of them, but like scans may be faster because done in parallel I'm not sure what most people are doing with time series data but this seems like a good approach to me. You basically just choose arbitrarily large P. An MD5 prefix is essentially P=2^128 (I wouldn't recommend pre-splitting at that granularity).
          Hide
          Sergey Shelukhin added a comment -

          a split could just rename the parent files into the daughter regions

          I was told this is impossible due to snapshots relying on files not moving between regions (or on references during splits?)
          We just discussed this here, some improvements for splits should definitely be possible.

          so yes, scans go to all servers but presumably they are huge and would hit lots of data anyway

          Yeah, I meant the scans, was assuming scans for TS data mostly. I see.

          Show
          Sergey Shelukhin added a comment - a split could just rename the parent files into the daughter regions I was told this is impossible due to snapshots relying on files not moving between regions (or on references during splits?) We just discussed this here, some improvements for splits should definitely be possible. so yes, scans go to all servers but presumably they are huge and would hit lots of data anyway Yeah, I meant the scans, was assuming scans for TS data mostly. I see.
          Hide
          Jimmy Xiang added a comment -

          Wouldn't that be enough to process delete markers that delete the updates within those files?

          I think so, as long as the files not included are newer (like in L0).

          Show
          Jimmy Xiang added a comment - Wouldn't that be enough to process delete markers that delete the updates within those files? I think so, as long as the files not included are newer (like in L0).
          Hide
          Ted Yu added a comment -

          I think we need to find the best way of storing stripe boundaries.

          Currently Sergey puts the boundaries in metadata of store files. This is not flexible. Once this feature is released, users would request ways to configure stripes in different manners.

          Show
          Ted Yu added a comment - I think we need to find the best way of storing stripe boundaries. Currently Sergey puts the boundaries in metadata of store files. This is not flexible. Once this feature is released, users would request ways to configure stripes in different manners.
          Hide
          Sergey Shelukhin added a comment -

          The stripe boundaries are supposed to be an internal detail not visible to the user, the user only configured the scheme (#of stripes, or size-based stripes as described for time series data).
          What kind of stripe configuration do you have in mind?
          After talking about splits yesterday I am planning to make it a little more flexible for splits, but still internal.

          Show
          Sergey Shelukhin added a comment - The stripe boundaries are supposed to be an internal detail not visible to the user, the user only configured the scheme (#of stripes, or size-based stripes as described for time series data). What kind of stripe configuration do you have in mind? After talking about splits yesterday I am planning to make it a little more flexible for splits, but still internal.
          Hide
          stack added a comment -

          Currently Sergey puts the boundaries in metadata of store files.

          What is wrong w/ this? It is the bounds of the keys the file contains? (Is this not there already, the first and last key?)

          What are the thoughts on figuring region stripe boundaries. Will it be done by looking at the memstore content just before flush and dividing it into n files? Or will it be done after the first flush on subsequent compactions by looking at content of L0?

          Show
          stack added a comment - Currently Sergey puts the boundaries in metadata of store files. What is wrong w/ this? It is the bounds of the keys the file contains? (Is this not there already, the first and last key?) What are the thoughts on figuring region stripe boundaries. Will it be done by looking at the memstore content just before flush and dividing it into n files? Or will it be done after the first flush on subsequent compactions by looking at content of L0?
          Hide
          Sergey Shelukhin added a comment -

          In the "N stripes" scheme, boundaries will be determined at first L0 compaction. I am intending to add parameter to config to make first L0 compaction wait for more files to get better ones. Then, the stripes can be rebalanced but the hope is that it happens infrequently.
          In the "growing key range" scheme (for time series above) stripe boundaries don't matter much, and are basically determined based on size. If nothing intervenes, by EOW or next week I hope to have preliminary compaction policy/compactor patch.

          Show
          Sergey Shelukhin added a comment - In the "N stripes" scheme, boundaries will be determined at first L0 compaction. I am intending to add parameter to config to make first L0 compaction wait for more files to get better ones. Then, the stripes can be rebalanced but the hope is that it happens infrequently. In the "growing key range" scheme (for time series above) stripe boundaries don't matter much, and are basically determined based on size. If nothing intervenes, by EOW or next week I hope to have preliminary compaction policy/compactor patch.
          Hide
          Ted Yu added a comment -

          It is the bounds of the keys the file contains? (Is this not there already, the first and last key?)

          Currently HFile provides the following methods:

              byte[] getFirstRowKey();
          
              byte[] getLastRowKey();
          

          Stripe boundaries should be different from the above.

          Show
          Ted Yu added a comment - It is the bounds of the keys the file contains? (Is this not there already, the first and last key?) Currently HFile provides the following methods: byte [] getFirstRowKey(); byte [] getLastRowKey(); Stripe boundaries should be different from the above.
          Hide
          stack added a comment -

          Ted Yu You raise a 'concern' that is nebulous to shoot down a suggested implementation and then when asked why you do not answer the question asked.

          Sergey Shelukhin Would it be easier looking at memstore than at content of L0? Would have to do appropriate weighting... in that there will be hot spots in the keyspace and these should get more stripes. Rejiggering the stripes joining up the infrequently used and allotting more stripes to the hot keyspace area sounds right. Good stuff.

          Maybe its worth writing up a doc at this stage. This issue is full of goodness. A doc could distill it out and make it easier on the digestion and easier to think about it.

          Show
          stack added a comment - Ted Yu You raise a 'concern' that is nebulous to shoot down a suggested implementation and then when asked why you do not answer the question asked. Sergey Shelukhin Would it be easier looking at memstore than at content of L0? Would have to do appropriate weighting... in that there will be hot spots in the keyspace and these should get more stripes. Rejiggering the stripes joining up the infrequently used and allotting more stripes to the hot keyspace area sounds right. Good stuff. Maybe its worth writing up a doc at this stage. This issue is full of goodness. A doc could distill it out and make it easier on the digestion and easier to think about it.
          Hide
          Sergey Shelukhin added a comment -

          Discussed with Ted, what was meant is configuring it like region splitting, which is not planned at this point.
          Doc - probably needed... I will get to it eventually.

          Show
          Sergey Shelukhin added a comment - Discussed with Ted, what was meant is configuring it like region splitting, which is not planned at this point. Doc - probably needed... I will get to it eventually.
          Hide
          Ted Yu added a comment -

          Here is my understanding of the difference between existing store file metadata and the new stripe boundary metadata.
          Existing store file metadata (first row key, last row key, etc) is intrinsic to the store file. i.e., their meaning doesn't change when the store file gets copied to another cluster.
          However stripe boundary doesn't seem to carry the same characteristics.
          Let's consider table A1 in cluster C1 and table A2 in cluster C2. They have the same schema and region boundaries. When store file F1 is copied from C1 to C2, user may switch to a different striping strategy. The reason could be that clusters C1 and C2 serve different patterns of load.
          Embedding stripe boundary in store file may potentially cause confusion in cluster C2.

          Another factor is that StoreFileManager needs to scan all store files to establish / validate stripe boundaries when region opens.

          Please correct me if I am wrong.

          Show
          Ted Yu added a comment - Here is my understanding of the difference between existing store file metadata and the new stripe boundary metadata. Existing store file metadata (first row key, last row key, etc) is intrinsic to the store file. i.e., their meaning doesn't change when the store file gets copied to another cluster. However stripe boundary doesn't seem to carry the same characteristics. Let's consider table A1 in cluster C1 and table A2 in cluster C2. They have the same schema and region boundaries. When store file F1 is copied from C1 to C2, user may switch to a different striping strategy. The reason could be that clusters C1 and C2 serve different patterns of load. Embedding stripe boundary in store file may potentially cause confusion in cluster C2. Another factor is that StoreFileManager needs to scan all store files to establish / validate stripe boundaries when region opens. Please correct me if I am wrong.
          Hide
          Sergey Shelukhin added a comment -

          If the user copies the file, few things may happen. If he uses the non-stripe compaction, files will just get compacted into fewer files eventually. If all files are copied, stripe compactions are used and stripe config is different, SFM will load old stripes and eventually re-stripe the data to conform to new config, if necessary. That will depend on compactionpolicy implementation. If part of the files are copied (whatever sense that makes), either the above will still happen, or SFM will give up on metadata, and put all files to L0, re-striping them according to new config after some time.

          SFM needs metadata from all files, but as far as I see from HStore that is already loaded, because other code makes use of metadata without special considerations. One thing is that with more files there will be more to open...

          Show
          Sergey Shelukhin added a comment - If the user copies the file, few things may happen. If he uses the non-stripe compaction, files will just get compacted into fewer files eventually. If all files are copied, stripe compactions are used and stripe config is different, SFM will load old stripes and eventually re-stripe the data to conform to new config, if necessary. That will depend on compactionpolicy implementation. If part of the files are copied (whatever sense that makes), either the above will still happen, or SFM will give up on metadata, and put all files to L0, re-striping them according to new config after some time. SFM needs metadata from all files, but as far as I see from HStore that is already loaded, because other code makes use of metadata without special considerations. One thing is that with more files there will be more to open...
          Hide
          Matt Corgan added a comment -

          Sergey - I'm curious what's the reasoning behind flushing memstore to a single L0 file rather than splitting the memstore into the stripes during each flush? Keep flushes faster, less files, etc?

          Show
          Matt Corgan added a comment - Sergey - I'm curious what's the reasoning behind flushing memstore to a single L0 file rather than splitting the memstore into the stripes during each flush? Keep flushes faster, less files, etc?
          Hide
          Sergey Shelukhin added a comment -

          My reasoning was - too many really tiny files, plus scope creep into memstore.

          Show
          Sergey Shelukhin added a comment - My reasoning was - too many really tiny files, plus scope creep into memstore.
          Hide
          stack added a comment -

          Would it simplify the filesystem implementation if you did the split in memory (caveat the scope creep up into memstore) so no special L0 tier? Regards files being too small, that is a different issue (e.g. a Toddea – Todd+idea – that I tripped over recently in an issue here is that rather than flush immediately, that we'd instead do a purge and compaction in memory before flushing to ensure the content large enough to make a file... but yeah, that'd be something else).

          Show
          stack added a comment - Would it simplify the filesystem implementation if you did the split in memory (caveat the scope creep up into memstore) so no special L0 tier? Regards files being too small, that is a different issue (e.g. a Toddea – Todd+idea – that I tripped over recently in an issue here is that rather than flush immediately, that we'd instead do a purge and compaction in memory before flushing to ensure the content large enough to make a file... but yeah, that'd be something else).
          Hide
          Matt Corgan added a comment -

          Gotcha. Agree about limiting scope. If the special L0 tier turns out to be more difficult to implement than originally intended for whatever reason, might be worth evaluating splitting during flush. Seems like the same number of files might get created anyway when you split the L0 file? Or do you plan on doing some "logical striping across the L1 boundary" as Nicolas says above where the L0 files are never truly split?

          Like Stack mentions, longer term I think we'll need to split memstore while in use, and those splits should probably have some alignment with these stripe boundaries. For another day...

          Show
          Matt Corgan added a comment - Gotcha. Agree about limiting scope. If the special L0 tier turns out to be more difficult to implement than originally intended for whatever reason, might be worth evaluating splitting during flush. Seems like the same number of files might get created anyway when you split the L0 file? Or do you plan on doing some "logical striping across the L1 boundary" as Nicolas says above where the L0 files are never truly split? Like Stack mentions, longer term I think we'll need to split memstore while in use, and those splits should probably have some alignment with these stripe boundaries. For another day...
          Hide
          Sergey Shelukhin added a comment -

          L0 actually should be relatively easy to implement, it's the special cases about all the possible stripe boundaries that cause all the complexity.
          L0 files will hopefully not be split individually, but as soon as we reach some number of files, similarly to level db algorithm.
          But yeah with insta-stripe solution we could get rid of L0 and also get less files for reads. Could be future improvement.

          Show
          Sergey Shelukhin added a comment - L0 actually should be relatively easy to implement, it's the special cases about all the possible stripe boundaries that cause all the complexity. L0 files will hopefully not be split individually, but as soon as we reach some number of files, similarly to level db algorithm. But yeah with insta-stripe solution we could get rid of L0 and also get less files for reads. Could be future improvement.
          Hide
          stack added a comment -

          I'd like to get rid of L0 so can do splitting w/o resorting to file References and half hfiles (smile).... but yes, could be future as you say Sergey.

          Show
          stack added a comment - I'd like to get rid of L0 so can do splitting w/o resorting to file References and half hfiles (smile).... but yes, could be future as you say Sergey.
          Hide
          Jimmy Xiang added a comment -

          Getting rid of L0 means memstore flushing will take longer and hold update/read longer?

          Show
          Jimmy Xiang added a comment - Getting rid of L0 means memstore flushing will take longer and hold update/read longer?
          Hide
          stack added a comment -

          Getting rid of L0 means memstore flushing will take longer and hold update/read longer?

          Yes but on back side, less compactions (no need to compact on region open since no half files/references) and for many workloads, less compactions overall because only a subset of files will be picked up from L0 tier.

          Can do work too to minimize how much we hold flushes. Could be dumb and on flush, inline, examine key spread so can make ruling on where to insert boundaries (figuring boundaries would be for first flush only? Or I suppose boundary making would be ongoing over the life of a region... four boundaries per region seems like a nice number to work with....)... so this would be a scan of the 64MB memstore keys.... or, we could keep a running tally as we insert into the memstore so at flush time we knew where the boundaries were... could do stuff like // request to NN to open the files to flush too.

          In general, we want to add smarts around flushing (stripes, in-memory compacting, etc.) so eventually we are going to have this friction on flush (IMO).

          Show
          stack added a comment - Getting rid of L0 means memstore flushing will take longer and hold update/read longer? Yes but on back side, less compactions (no need to compact on region open since no half files/references) and for many workloads, less compactions overall because only a subset of files will be picked up from L0 tier. Can do work too to minimize how much we hold flushes. Could be dumb and on flush, inline, examine key spread so can make ruling on where to insert boundaries (figuring boundaries would be for first flush only? Or I suppose boundary making would be ongoing over the life of a region... four boundaries per region seems like a nice number to work with....)... so this would be a scan of the 64MB memstore keys.... or, we could keep a running tally as we insert into the memstore so at flush time we knew where the boundaries were... could do stuff like // request to NN to open the files to flush too. In general, we want to add smarts around flushing (stripes, in-memory compacting, etc.) so eventually we are going to have this friction on flush (IMO).
          Hide
          Sergey Shelukhin added a comment -

          Ok, here's the proper functional spec-like document. Future improvements is incomplete...
          Should make CRs easier.

          Show
          Sergey Shelukhin added a comment - Ok, here's the proper functional spec-like document. Future improvements is incomplete... Should make CRs easier.
          Hide
          Ted Yu added a comment -

          Nice document, Sergey.

          This can obviously be improved since most (or all) stripes, see future improvements.

          The first half of the sentence seems to be incomplete.

          Before starting a new writer, compactor ensures that all the KVs for the last row in the previous writer go to the previous writer.

          I think I understand what you mean - such KVs would be written by previous writer.

          Show
          Ted Yu added a comment - Nice document, Sergey. This can obviously be improved since most (or all) stripes, see future improvements. The first half of the sentence seems to be incomplete. Before starting a new writer, compactor ensures that all the KVs for the last row in the previous writer go to the previous writer. I think I understand what you mean - such KVs would be written by previous writer.
          Hide
          Sergey Shelukhin added a comment -

          Hmm... I am starting to think that we might want to consider getting rid of L0 after all. Somehow it escaped me that L0 gives you x2 write amplification right there as all data has to be re-striped. Small files actually don't increase the number of files for gets/non-overlapping scans because each L0 file still counts for each stripe.

          Show
          Sergey Shelukhin added a comment - Hmm... I am starting to think that we might want to consider getting rid of L0 after all. Somehow it escaped me that L0 gives you x2 write amplification right there as all data has to be re-striped. Small files actually don't increase the number of files for gets/non-overlapping scans because each L0 file still counts for each stripe.
          Hide
          M. C. Srivas added a comment -

          There is of course one major caveat with this approach. If data insertion is uniformly spread (ie, key is uniform random), this proposal performs much worse than the existing scheme.

          Show
          M. C. Srivas added a comment - There is of course one major caveat with this approach. If data insertion is uniformly spread (ie, key is uniform random), this proposal performs much worse than the existing scheme.
          Hide
          Sergey Shelukhin added a comment -

          There are two approaches discussed, one similar to having many small regions and one for sequential data. Which one do you mean? I am testing the first one with uniformly distributed keys now and it's somewhat slower than default case on average (on writes mostly) but has no big compaction associated latency spikes... I suspect if there was not so much compaction due to L0 the write slowness could also be alleviated.
          I haven't tested the 2nd one yet (requires a more custom test, next week) but it's very specialized, for sequential data, so yes it is not good for common case.

          Show
          Sergey Shelukhin added a comment - There are two approaches discussed, one similar to having many small regions and one for sequential data. Which one do you mean? I am testing the first one with uniformly distributed keys now and it's somewhat slower than default case on average (on writes mostly) but has no big compaction associated latency spikes... I suspect if there was not so much compaction due to L0 the write slowness could also be alleviated. I haven't tested the 2nd one yet (requires a more custom test, next week) but it's very specialized, for sequential data, so yes it is not good for common case.
          Hide
          stack added a comment -

          There is of course one major caveat with this approach. If data insertion is uniformly spread (ie, key is uniform random), this proposal performs much worse than the existing scheme.

          M. C. Srivas Why? Won't it do same total i/o?

          Show
          stack added a comment - There is of course one major caveat with this approach. If data insertion is uniformly spread (ie, key is uniform random), this proposal performs much worse than the existing scheme. M. C. Srivas Why? Won't it do same total i/o?
          Hide
          stack added a comment -
          On attached doc, it is lovely.
          
          Missing author, date, and JIRA pointer?
          
          An interesting comment by LarsH recently was that maybe we should ship w /major compactions off; most folks don't delete
          
          Missing is one a pointer at least to how it currently works (could just point at src file I'd say with its description of 'sigma' compactions) and a sentence on whats wrong w/ it
          or the problems it leads too when left run amok (you say it for major compactions but even w/o major compactions enabled, an i/o tsunami can hit and wipe us out
          
          What does this mean "and	old	boundaries	rarely,	if	ever,	moving."?  Give doc an edit?
          
          I think you need to say stripe == sub-range of the region key range.  You almost do.  Just do it explicitly.
          
          I see your extra justification for l0, the need to be able to bulk load.  It is kinda important that we continue to support that.  Good one.
          
          Later I suppose we could have a combination of count-based and size-based.... if an edge stripe is N time bigger than any other, add a new stripe?
          
          I was wondering if you could make use of liang xie's bit of code for making keys for the block cache where he chooses a byte sequence that falls between
          the last key in the former block and the first in the next block but the key is shorter than either..... but it doesn't make sense here I believe;
          your boundaries have to be hard actual keys given inserts are always coming in.... so nevermind this suggestion.
          
          You write the stripe info to the storefile.  I suppose it is up to the hosting region whether or not it chooses to respect those boundaries.  It
          could ignore them and just respect the seqnum and we'd have the old-style storefile handling, right?  (Oh, I see you allow for this -- good)
          
          Say in doc that you mean storefile metadata else it is ambiguous.
          
          Thinking on L0 again, as has been discussed, we could have flushes skip L0 and flush instead to stripes (one flush turns into N files, one per stripe)
          but even if we had this optimization, it looks like we'd still want the L0 option if only for bulk loaded files or for files whose metadata makes
          no sense to the current region context.
          
          "• The	aggregate	range	of	files	going	in	must	be	contiguous..." Not sure I follow.  Hmm... could do with ".... going into a compaction"
          
          "If	the	stripe	boundaries	are	changed	by	compaction,	the	entire	stripes	with	old	boundaries	must	be	replaced" ...What would bring this on?
          And then how would old boundaries get redone?  This one is a bit confusing.
          
          Get key before is a PITA
          
          Not sure I follow here: "This	compaction	is	performed	when	the	number	of	L0	files	
          exceeds	some	threshold	and	produces	the	number	of	files	equivalent	to	the	number	
          of	stripes,	with	enforced	existing	boundaries."
          
          I was going to suggest an optimization for later for the case that an L0 fits fully inside a stripe, I was thinking you could just 'move' it into
          its respective stripe... but I suppose you can't do that because you need to write the metadata to put a file into a stripe...
          
          Would it help naming files for the stripe they belong too?  Would that help?  In other words do NOT write stripe data to the storefiles and just
          let the region in memory figure which stripe a file belongs too.  When we write, we write with say a L0 suffix.  When compacting we add S1, S2, 
          etc suffix for stripe1, etc.  To figure what the boundaries of an S0 are, it'd be something the region knew.  On open of the store files, it could
          use the start and end keys that are currently in the file metadata to figure which stripe they fit in.
          
          Would be a bit looser.  Would allow moving a file between stripes with a rename only.
          
          The delete dropping section looks right.  I like the major compaction along a stripe only option.
          
          "For	empty	 ranges,	empty	files	are	created."  Is this necessary?  Would be good to avoid doing this.
          
          
          
          Show
          stack added a comment - On attached doc, it is lovely. Missing author, date, and JIRA pointer? An interesting comment by LarsH recently was that maybe we should ship w /major compactions off; most folks don't delete Missing is one a pointer at least to how it currently works (could just point at src file I'd say with its description of 'sigma' compactions) and a sentence on whats wrong w/ it or the problems it leads too when left run amok (you say it for major compactions but even w/o major compactions enabled, an i/o tsunami can hit and wipe us out What does this mean "and old boundaries rarely, if ever, moving." ? Give doc an edit? I think you need to say stripe == sub-range of the region key range. You almost do . Just do it explicitly. I see your extra justification for l0, the need to be able to bulk load. It is kinda important that we continue to support that. Good one. Later I suppose we could have a combination of count-based and size-based.... if an edge stripe is N time bigger than any other, add a new stripe? I was wondering if you could make use of liang xie's bit of code for making keys for the block cache where he chooses a byte sequence that falls between the last key in the former block and the first in the next block but the key is shorter than either..... but it doesn't make sense here I believe; your boundaries have to be hard actual keys given inserts are always coming in.... so nevermind this suggestion. You write the stripe info to the storefile. I suppose it is up to the hosting region whether or not it chooses to respect those boundaries. It could ignore them and just respect the seqnum and we'd have the old-style storefile handling, right? (Oh, I see you allow for this -- good) Say in doc that you mean storefile metadata else it is ambiguous. Thinking on L0 again, as has been discussed, we could have flushes skip L0 and flush instead to stripes (one flush turns into N files, one per stripe) but even if we had this optimization, it looks like we'd still want the L0 option if only for bulk loaded files or for files whose metadata makes no sense to the current region context. "• The aggregate range of files going in must be contiguous..." Not sure I follow. Hmm... could do with ".... going into a compaction" "If the stripe boundaries are changed by compaction, the entire stripes with old boundaries must be replaced" ...What would bring this on? And then how would old boundaries get redone? This one is a bit confusing. Get key before is a PITA Not sure I follow here: "This compaction is performed when the number of L0 files exceeds some threshold and produces the number of files equivalent to the number of stripes, with enforced existing boundaries." I was going to suggest an optimization for later for the case that an L0 fits fully inside a stripe, I was thinking you could just 'move' it into its respective stripe... but I suppose you can't do that because you need to write the metadata to put a file into a stripe... Would it help naming files for the stripe they belong too? Would that help? In other words do NOT write stripe data to the storefiles and just let the region in memory figure which stripe a file belongs too. When we write, we write with say a L0 suffix. When compacting we add S1, S2, etc suffix for stripe1, etc. To figure what the boundaries of an S0 are, it'd be something the region knew. On open of the store files, it could use the start and end keys that are currently in the file metadata to figure which stripe they fit in. Would be a bit looser. Would allow moving a file between stripes with a rename only. The delete dropping section looks right. I like the major compaction along a stripe only option. "For empty ranges, empty files are created." Is this necessary? Would be good to avoid doing this .
          Hide
          Matt Corgan added a comment -

          If data insertion is uniformly spread (ie, key is uniform random), this proposal performs much worse than the existing scheme.

          I think the goal for uniformly random keys is to have the same amount of total work done but to stagger that work. Instead of doing 1 big 24 GB compaction per day, it could do a 1 GB compaction each hour.

          The savings/efficiency become more pronounced with less random keys, with the biggest savings for sequential keys.

          Show
          Matt Corgan added a comment - If data insertion is uniformly spread (ie, key is uniform random), this proposal performs much worse than the existing scheme. I think the goal for uniformly random keys is to have the same amount of total work done but to stagger that work. Instead of doing 1 big 24 GB compaction per day, it could do a 1 GB compaction each hour. The savings/efficiency become more pronounced with less random keys, with the biggest savings for sequential keys.
          Hide
          M. C. Srivas added a comment -

          @mcorgan and @stack:

          The total i/o in terms of i/o bandwidth consumed is the same. But the disk iops are much, much worse. And disk iops are at a premium, and "bg activity" like compactions should consume as few as possible.

          Let's say we split a region into a 100 sub-regions, such that each sub-region is in the few 10's of MB. If the data is written uniformly randomly, each sub-region will write out a store at approx the same time. That is, a RS will write 100x more files into HDFS (100x more random i/o on the local file-system). Next, all sub-regions will do a compaction at almost the same time, which is again 100x more read iops to read the old stores for merging.

          One can try to stagger the compactions to avoid the sudden burst by incorporating, say, a queue of to-be-compacted-subregions. But while the sub-regions at the head of the queue will compact "in time", the ones at the end of the queue will have many more store files to merge, and will use much more than their "fair-share" of iops (not to mention that the read-amplification in these sub-regions will be higher too). The iops profile will be worse than just 100x.

          Show
          M. C. Srivas added a comment - @mcorgan and @stack: The total i/o in terms of i/o bandwidth consumed is the same. But the disk iops are much, much worse. And disk iops are at a premium, and "bg activity" like compactions should consume as few as possible. Let's say we split a region into a 100 sub-regions, such that each sub-region is in the few 10's of MB. If the data is written uniformly randomly, each sub-region will write out a store at approx the same time. That is, a RS will write 100x more files into HDFS (100x more random i/o on the local file-system). Next, all sub-regions will do a compaction at almost the same time, which is again 100x more read iops to read the old stores for merging. One can try to stagger the compactions to avoid the sudden burst by incorporating, say, a queue of to-be-compacted-subregions. But while the sub-regions at the head of the queue will compact "in time", the ones at the end of the queue will have many more store files to merge, and will use much more than their "fair-share" of iops (not to mention that the read-amplification in these sub-regions will be higher too). The iops profile will be worse than just 100x.
          Hide
          stack added a comment -

          M. C. Srivas

          But the disk iops are much, much worse.

          See Sergey's writeup. We flush same as we always did writing a single file to the L0 tier. It is later at compaction time – i.e. NOT random i/o – that we'd write a file per "sub-region/stripe". If the write is evenly distributed, we'd do the same overall i/o except with stripe compacting it would be done in smaller bite sizes (Sergey Shelukhin Would the compaction of stripes run in //? Hopefully, for the case Srivas describes, we'd progress serially through the stripes/sub-regions or at least it would be an option and then later, ergonomically, we'd recognize the even-loading case and add compaction accordingly)

          You have a point that we will be making more files in the fs.

          Show
          stack added a comment - M. C. Srivas But the disk iops are much, much worse. See Sergey's writeup. We flush same as we always did writing a single file to the L0 tier. It is later at compaction time – i.e. NOT random i/o – that we'd write a file per "sub-region/stripe". If the write is evenly distributed, we'd do the same overall i/o except with stripe compacting it would be done in smaller bite sizes ( Sergey Shelukhin Would the compaction of stripes run in //? Hopefully, for the case Srivas describes, we'd progress serially through the stripes/sub-regions or at least it would be an option and then later, ergonomically, we'd recognize the even-loading case and add compaction accordingly) You have a point that we will be making more files in the fs.
          Hide
          Matt Corgan added a comment -

          a RS will write 100x more files into HDFS (100x more random i/o on the local file-system)

          I think this is a point of confusion. A typical HBase file could be 4MB to 40GB, where those files are a series of 4KB (very small) underlying disk blocks. Ignoring complexities of multiple tasks running simultaneously on the regionserver, only the first 4KB block of each file is a random write, while the following blocks are sequential writes. The few extra random writes should be lost in the noise of all the other random IO requests happening.

          Show
          Matt Corgan added a comment - a RS will write 100x more files into HDFS (100x more random i/o on the local file-system) I think this is a point of confusion. A typical HBase file could be 4MB to 40GB, where those files are a series of 4KB (very small) underlying disk blocks. Ignoring complexities of multiple tasks running simultaneously on the regionserver, only the first 4KB block of each file is a random write, while the following blocks are sequential writes. The few extra random writes should be lost in the noise of all the other random IO requests happening.
          Hide
          Sergey Shelukhin added a comment -

          The first half of the sentence seems to be incomplete.

          I think I understand what you mean - such KVs would be written by previous writer.

          Missing author, date, and JIRA pointer?

          I think you need to say stripe == sub-range of the region key range. You almost do. Just do it explicitly.

          What does this mean "and old boundaries rarely, if ever, moving."? Give doc an edit?

          Say in doc that you mean storefile metadata else it is ambiguous.

          Not sure I follow here: "This compaction is performed when the number of L0 files exceeds some threshold and produces the number of files equivalent to the number of stripes, with enforced existing boundaries."

          Fixed these.

          An interesting comment by LarsH recently was that maybe we should ship w /major compactions off; most folks don't delete

          Hmm... in general I agree but we'll have to insert really good warnings everywhere. Can we detect if they delete?

          Missing is one a pointer at least to how it currently works (could just point at src file I'd say with its description of 'sigma' compactions) and a sentence on whats wrong w/ it

          Later I suppose we could have a combination of count-based and size-based.... if an edge stripe is N time bigger than any other, add a new stripe?

          Yeah, it's mentioned in code comment somewhere.

          I was wondering if you could make use of liang xie's bit of code for making keys for the block cache where he chooses a byte sequence that falls between the last key in the former block and the first in the next block but the key is shorter than either..... but it doesn't make sense here I believe; your boundaries have to be hard actual keys given inserts are always coming in.... so nevermind this suggestion.

          For boundary determination it does make sense; can you point at the code? After cursory look I cannot find it.

          You write the stripe info to the storefile. I suppose it is up to the hosting region whether or not it chooses to respect those boundaries. It could ignore them and just respect the seqnum and we'd have the old-style storefile handling, right? (Oh, I see you allow for this – good)

          Yes.

          Thinking on L0 again, as has been discussed, we could have flushes skip L0 and flush instead to stripes (one flush turns into N files, one per stripe) but even if we had this optimization, it looks like we'd still want the L0 option if only for bulk loaded files or for files whose metadata makes no sense to the current region context. "• The aggregate range of files going in must be contiguous..." Not sure I follow. Hmm... could do with ".... going into a compaction"

          Yes, that was my thinking too.

          "If the stripe boundaries are changed by compaction, the entire stripes with old boundaries must be replaced" ...What would bring this on? And then how would old boundaries get redone? This one is a bit confusing.

          Clarified. Basically one cannot have 3 files in (-inf, 3) and 3 in [3, inf), then take 3 and 2 respectively, and rewrite them with boundary 4, because then there will be a file with [3, inf) remaining that overlaps.

          I was going to suggest an optimization for later for the case that an L0 fits fully inside a stripe, I was thinking you could just 'move' it into its respective stripe... but I suppose you can't do that because you need to write the metadata to put a file into a stripe...

          Yeah. Also wouldn't expect it to be a common case.

          Would it help naming files for the stripe they belong too? Would that help? In other words do NOT write stripe data to the storefiles and just let the region in memory figure which stripe a file belongs too. When we write, we write with say a L0 suffix. When compacting we add S1, S2, etc suffix for stripe1, etc. To figure what the boundaries of an S0 are, it'd be something the region knew. On open of the store files, it could use the start and end keys that are currently in the file metadata to figure which stripe they fit in.

          Would be a bit looser. Would allow moving a file between stripes with a rename only. The delete dropping section looks right. I like the major compaction along a stripe only option.

          This could be done as future improvement. The implications of change of naming scheme for other parts of the systems need to be determined.
          Also for all I know it might break snapshots (moving files does). And, code to figure ut stripes on the fly would be more complex.

          "For empty ranges, empty files are created." Is this necessary? Would be good to avoid doing this.

          Let me think about this...

          The total i/o in terms of i/o bandwidth consumed is the same. But the disk iops are much, much worse. And disk iops are at a premium, and "bg activity" like compactions should consume as few as possible.

          Let's say we split a region into a 100 sub-regions, such that each sub-region is in the few 10's of MB. If the data is written uniformly randomly, each sub-region will write out a store at approx the same time. That is, a RS will write 100x more files into HDFS (100x more random i/o on the local file-system). Next, all sub-regions will do a compaction at almost the same time, which is again 100x more read iops to read the old stores for merging.

          Memstore for region is preserved as unified... it may be written out to multiple files indeed in future.

          One can try to stagger the compactions to avoid the sudden burst by incorporating, say, a queue of to-be-compacted-subregions. But while the sub-regions at the head of the queue will compact "in time", the ones at the end of the queue will have many more store files to merge, and will use much more than their "fair-share" of iops (not to mention that the read-amplification in these sub-regions will be higher too). The iops profile will be worse than just 100x.

          In current implementation the region is limited to one compaction at a time, mostly for simplicity sake. Yes, if all stripes compact at the same time for the uniform scheme all improvement will disappear; this will have to be controlled if ability to do so is added.

          You have a point that we will be making more files in the fs.

          Yeah, that is inevitable.
          I hear from someone from Accumulo that they have tons of files opened without any problems... it may make sense to investigate if we have problems.

          Show
          Sergey Shelukhin added a comment - The first half of the sentence seems to be incomplete. I think I understand what you mean - such KVs would be written by previous writer. Missing author, date, and JIRA pointer? I think you need to say stripe == sub-range of the region key range. You almost do. Just do it explicitly. What does this mean "and old boundaries rarely, if ever, moving."? Give doc an edit? Say in doc that you mean storefile metadata else it is ambiguous. Not sure I follow here: "This compaction is performed when the number of L0 files exceeds some threshold and produces the number of files equivalent to the number of stripes, with enforced existing boundaries." Fixed these. An interesting comment by LarsH recently was that maybe we should ship w /major compactions off; most folks don't delete Hmm... in general I agree but we'll have to insert really good warnings everywhere. Can we detect if they delete? Missing is one a pointer at least to how it currently works (could just point at src file I'd say with its description of 'sigma' compactions) and a sentence on whats wrong w/ it Later I suppose we could have a combination of count-based and size-based.... if an edge stripe is N time bigger than any other, add a new stripe? Yeah, it's mentioned in code comment somewhere. I was wondering if you could make use of liang xie's bit of code for making keys for the block cache where he chooses a byte sequence that falls between the last key in the former block and the first in the next block but the key is shorter than either..... but it doesn't make sense here I believe; your boundaries have to be hard actual keys given inserts are always coming in.... so nevermind this suggestion. For boundary determination it does make sense; can you point at the code? After cursory look I cannot find it. You write the stripe info to the storefile. I suppose it is up to the hosting region whether or not it chooses to respect those boundaries. It could ignore them and just respect the seqnum and we'd have the old-style storefile handling, right? (Oh, I see you allow for this – good) Yes. Thinking on L0 again, as has been discussed, we could have flushes skip L0 and flush instead to stripes (one flush turns into N files, one per stripe) but even if we had this optimization, it looks like we'd still want the L0 option if only for bulk loaded files or for files whose metadata makes no sense to the current region context. "• The aggregate range of files going in must be contiguous..." Not sure I follow. Hmm... could do with ".... going into a compaction" Yes, that was my thinking too. "If the stripe boundaries are changed by compaction, the entire stripes with old boundaries must be replaced" ...What would bring this on? And then how would old boundaries get redone? This one is a bit confusing. Clarified. Basically one cannot have 3 files in (-inf, 3) and 3 in [3, inf), then take 3 and 2 respectively, and rewrite them with boundary 4, because then there will be a file with [3, inf) remaining that overlaps. I was going to suggest an optimization for later for the case that an L0 fits fully inside a stripe, I was thinking you could just 'move' it into its respective stripe... but I suppose you can't do that because you need to write the metadata to put a file into a stripe... Yeah. Also wouldn't expect it to be a common case. Would it help naming files for the stripe they belong too? Would that help? In other words do NOT write stripe data to the storefiles and just let the region in memory figure which stripe a file belongs too. When we write, we write with say a L0 suffix. When compacting we add S1, S2, etc suffix for stripe1, etc. To figure what the boundaries of an S0 are, it'd be something the region knew. On open of the store files, it could use the start and end keys that are currently in the file metadata to figure which stripe they fit in. Would be a bit looser. Would allow moving a file between stripes with a rename only. The delete dropping section looks right. I like the major compaction along a stripe only option. This could be done as future improvement. The implications of change of naming scheme for other parts of the systems need to be determined. Also for all I know it might break snapshots (moving files does). And, code to figure ut stripes on the fly would be more complex. "For empty ranges, empty files are created." Is this necessary? Would be good to avoid doing this. Let me think about this... The total i/o in terms of i/o bandwidth consumed is the same. But the disk iops are much, much worse. And disk iops are at a premium, and "bg activity" like compactions should consume as few as possible. Let's say we split a region into a 100 sub-regions, such that each sub-region is in the few 10's of MB. If the data is written uniformly randomly, each sub-region will write out a store at approx the same time. That is, a RS will write 100x more files into HDFS (100x more random i/o on the local file-system). Next, all sub-regions will do a compaction at almost the same time, which is again 100x more read iops to read the old stores for merging. Memstore for region is preserved as unified... it may be written out to multiple files indeed in future. One can try to stagger the compactions to avoid the sudden burst by incorporating, say, a queue of to-be-compacted-subregions. But while the sub-regions at the head of the queue will compact "in time", the ones at the end of the queue will have many more store files to merge, and will use much more than their "fair-share" of iops (not to mention that the read-amplification in these sub-regions will be higher too). The iops profile will be worse than just 100x. In current implementation the region is limited to one compaction at a time, mostly for simplicity sake. Yes, if all stripes compact at the same time for the uniform scheme all improvement will disappear; this will have to be controlled if ability to do so is added. You have a point that we will be making more files in the fs. Yeah, that is inevitable. I hear from someone from Accumulo that they have tons of files opened without any problems... it may make sense to investigate if we have problems.
          Hide
          stack added a comment -

          " HBASE-7845 optimize hfile index key" is the key/"boundary determination" work I was referring to (I don't think it applies here but adding the reference since you asked for it)

          Show
          stack added a comment - " HBASE-7845 optimize hfile index key" is the key/"boundary determination" work I was referring to (I don't think it applies here but adding the reference since you asked for it)
          Hide
          Sergey Shelukhin added a comment -

          updated doc

          Show
          Sergey Shelukhin added a comment - updated doc
          Hide
          Lars Hofhansl added a comment -

          An interesting comment by LarsH recently was that maybe we should ship w /major compactions off; most folks don't delete

          Hmm... I don't doubt that I said this, but I'm not sure that I agree Many people do delete and just not removing the delete markers would be unexpected.

          Show
          Lars Hofhansl added a comment - An interesting comment by LarsH recently was that maybe we should ship w /major compactions off; most folks don't delete Hmm... I don't doubt that I said this, but I'm not sure that I agree Many people do delete and just not removing the delete markers would be unexpected.
          Hide
          Sergey Shelukhin added a comment -

          perf doc... size test is not finished yet.

          Show
          Sergey Shelukhin added a comment - perf doc... size test is not finished yet.
          Hide
          Andrew Purtell added a comment -

          I'd recommend retesting with c1.xlarge instance types, this will get you a lot closer to real hardware IMHO. The IO capability of the c1.xlarge is "high" vs. only "moderate" for m1.large and the c1.xlarge has 8 vcores as opposed to 2 only for the m1.large. The c1.xlarge will have 4 locally attached instance-store volumes while the m1.large has only 2 IIRC. Also, I didn't see it mentioned in the perf doc but you should use only the locally attached instance store volumes as datanode storage volumes to avoid variance introduced by EBS.

          Show
          Andrew Purtell added a comment - I'd recommend retesting with c1.xlarge instance types, this will get you a lot closer to real hardware IMHO. The IO capability of the c1.xlarge is "high" vs. only "moderate" for m1.large and the c1.xlarge has 8 vcores as opposed to 2 only for the m1.large. The c1.xlarge will have 4 locally attached instance-store volumes while the m1.large has only 2 IIRC. Also, I didn't see it mentioned in the perf doc but you should use only the locally attached instance store volumes as datanode storage volumes to avoid variance introduced by EBS.
          Hide
          Sergey Shelukhin added a comment -

          I will update the doc, although in this case m1.large moderate IO capacity works even better for the test, making it easier to simulate IO-constrained cluster on such small number of nodes/time frame.

          Show
          Sergey Shelukhin added a comment - I will update the doc, although in this case m1.large moderate IO capacity works even better for the test, making it easier to simulate IO-constrained cluster on such small number of nodes/time frame.
          Hide
          Sergey Shelukhin added a comment -

          Updating both docs. Size-based logic test result, as well as design improvement based on that.

          Show
          Sergey Shelukhin added a comment - Updating both docs. Size-based logic test result, as well as design improvement based on that.
          Hide
          Sergey Shelukhin added a comment -

          Btw, the 3 next child JIRAs are rady for review. Please feel free to +1 them, I will only commit all 3 together and with integration test included.

          Show
          Sergey Shelukhin added a comment - Btw, the 3 next child JIRAs are rady for review. Please feel free to +1 them, I will only commit all 3 together and with integration test included.
          Hide
          Sergey Shelukhin added a comment -

          I did a c1.xlarge test (with default, 3, 10 and 25 stripes, 2 times each). The results for different stripe configurations are very consistent across both runs.
          Compared to m1.large test the positive effect of increasing number of stripes on write speed is less.

          For this load, sweet spot appears to be around 10-12 stripes based on two tests. 3 stripes have large compactions similar to default (well, not as large); 25 stripes does too many small compactions, so select-compact loop cannot keep up with the number of files produced - on "Iteration 2" test described in the doc at least some stripes in 25-stripe case always have 6-8 small files (as they get compacted other stripes get more files). This appears to be the limiting factor on increasing the number of stripes.
          I think the main point is that, for count scheme, there's perf parity (writes are generally slightly slower, reads slightly faster), despite existing and fixable write amplification; and there's reduction of variability, which was the goal. I will try to devise a more realistic read workload, but I don't think it should change much given above.
          For sequential data, with size-based stripe scheme there's reduction in compactions, as expected, despite even L0.

          Next steps:
          1) On existing data I want to correlate read/write perf with compactions. It is interesting that stripe scheme has slower writes in general, as Jimmy has noted - it touches read path but not anything at all on write path, so it is probably I/O related, or stresses some interaction between existing write and compaction paths.
          2) Run tests for more realistic read workloads (and parallel read/writes), by not using LoadTestTool? Optional-ish.
          3) Clean up integration test patch in HBASE-8000.
          4) Review and commit?

          5) Get rid of L0?

          Show
          Sergey Shelukhin added a comment - I did a c1.xlarge test (with default, 3, 10 and 25 stripes, 2 times each). The results for different stripe configurations are very consistent across both runs. Compared to m1.large test the positive effect of increasing number of stripes on write speed is less. For this load, sweet spot appears to be around 10-12 stripes based on two tests. 3 stripes have large compactions similar to default (well, not as large); 25 stripes does too many small compactions, so select-compact loop cannot keep up with the number of files produced - on "Iteration 2" test described in the doc at least some stripes in 25-stripe case always have 6-8 small files (as they get compacted other stripes get more files). This appears to be the limiting factor on increasing the number of stripes. I think the main point is that, for count scheme, there's perf parity (writes are generally slightly slower, reads slightly faster), despite existing and fixable write amplification; and there's reduction of variability, which was the goal. I will try to devise a more realistic read workload, but I don't think it should change much given above. For sequential data, with size-based stripe scheme there's reduction in compactions, as expected, despite even L0. Next steps: 1) On existing data I want to correlate read/write perf with compactions. It is interesting that stripe scheme has slower writes in general, as Jimmy has noted - it touches read path but not anything at all on write path, so it is probably I/O related, or stresses some interaction between existing write and compaction paths. 2) Run tests for more realistic read workloads (and parallel read/writes), by not using LoadTestTool? Optional-ish. 3) Clean up integration test patch in HBASE-8000 . 4) Review and commit? 5) Get rid of L0?
          Hide
          Ted Yu added a comment -

          5) Get rid of L0?

          Can we do this frist ?

          Show
          Ted Yu added a comment - 5) Get rid of L0? Can we do this frist ?
          Hide
          Matt Corgan added a comment -

          Sergey - i'm curious how are compactions of the stripes being scheduled/queued? Does a region still make a single region-wide compaction request, and the compactor picks a single stripe? Or can multiple stripes be in the compaction queue at once?

          Given that regions could be allowed to grow much larger with stripe compaction enabled it would probably be good to allow multiple stripes to compact in parallel. Just a thought for another next step... you've probably considered it already.

          Show
          Matt Corgan added a comment - Sergey - i'm curious how are compactions of the stripes being scheduled/queued? Does a region still make a single region-wide compaction request, and the compactor picks a single stripe? Or can multiple stripes be in the compaction queue at once? Given that regions could be allowed to grow much larger with stripe compaction enabled it would probably be good to allow multiple stripes to compact in parallel. Just a thought for another next step... you've probably considered it already.
          Hide
          Sergey Shelukhin added a comment -

          Currently only one compaction per store is allowed. The need to compact several stripes in parallel can probably be alleviated by just having less stripes? As future improvement it is possible to add.

          Show
          Sergey Shelukhin added a comment - Currently only one compaction per store is allowed. The need to compact several stripes in parallel can probably be alleviated by just having less stripes? As future improvement it is possible to add.
          Hide
          Sergey Shelukhin added a comment -

          Actually, judging by logs what can be done is triggering compaction thread if store can compact. In 25-stripe case I see gaps between compactions which are unnecessary, when the compaction only triggers on flush despite plenty of tiny stripe compactions being possible

          Show
          Sergey Shelukhin added a comment - Actually, judging by logs what can be done is triggering compaction thread if store can compact. In 25-stripe case I see gaps between compactions which are unnecessary, when the compaction only triggers on flush despite plenty of tiny stripe compactions being possible
          Hide
          Sergey Shelukhin added a comment -

          Updating the perf evaluation, I think I'm done with that for now. Looking for CRs
          I will not have time next few days but I will get to noted optimizations (L0) after that

          Show
          Sergey Shelukhin added a comment - Updating the perf evaluation, I think I'm done with that for now. Looking for CRs I will not have time next few days but I will get to noted optimizations (L0) after that
          Hide
          Sergey Shelukhin added a comment -

          First draft of user-level doc. After trying to describe the size-based scheme, I think it should be improved. I will do that. Meanwhile there's design doc and user doc, so I'd like to get some reviews
          I will rebase and update all patches between now and monday. stack Matteo Bertozzi what do you guys think?

          Show
          Sergey Shelukhin added a comment - First draft of user-level doc. After trying to describe the size-based scheme, I think it should be improved. I will do that. Meanwhile there's design doc and user doc, so I'd like to get some reviews I will rebase and update all patches between now and monday. stack Matteo Bertozzi what do you guys think?
          Hide
          Sergey Shelukhin added a comment -

          Updating design and user doc for latest changes. Now all the changes for the first cut are definitely in

          Show
          Sergey Shelukhin added a comment - Updating design and user doc for latest changes. Now all the changes for the first cut are definitely in
          Hide
          Raymond added a comment -

          great, more region lead to better load balance and good compaction effect, less region lead to easy management and fast failover, I think stripe (or sub-region) is a good trade-off.
          And I think stripe compaction is similar with Level compaction with L0+L1 only.
          Another difficult is about configuration, in big hbase cluster, there are so many applications, how to build suitable configuation for each one will be a huge challenge.

          Show
          Raymond added a comment - great, more region lead to better load balance and good compaction effect, less region lead to easy management and fast failover, I think stripe (or sub-region) is a good trade-off. And I think stripe compaction is similar with Level compaction with L0+L1 only. Another difficult is about configuration, in big hbase cluster, there are so many applications, how to build suitable configuation for each one will be a huge challenge.
          Hide
          Sergey Shelukhin added a comment -

          On recent HBase meeting Jonathan Hsieh asked me to provide an easier to understand chart of perf.
          I haven't ran new experiments since then, and to set up new ones it will take some time (because I want to get good ones to use for con slides ). For now attaching a primitive one I made out of old data, for reads using loadtesttool against default-compacted and stripe-compacted table. 500 data points for each.

          The experiment setup is described in perf doc and is the one on c1.xlarge instances. Fixed 10-stripe scheme vs. default scheme was used, with 3 relatively large (growing to several gigs) regions, with interleaving batches of writes and reads.

          Show
          Sergey Shelukhin added a comment - On recent HBase meeting Jonathan Hsieh asked me to provide an easier to understand chart of perf. I haven't ran new experiments since then, and to set up new ones it will take some time (because I want to get good ones to use for con slides ). For now attaching a primitive one I made out of old data, for reads using loadtesttool against default-compacted and stripe-compacted table. 500 data points for each. The experiment setup is described in perf doc and is the one on c1.xlarge instances. Fixed 10-stripe scheme vs. default scheme was used, with 3 relatively large (growing to several gigs) regions, with interleaving batches of writes and reads.
          Hide
          Elliott Clark added a comment -

          Thanks for the doc. Reading this tonight.

          Show
          Elliott Clark added a comment - Thanks for the doc. Reading this tonight.
          Hide
          Jonathan Hsieh added a comment -

          Sergey Shelukhin, thanks for the graph. Seeing that, I think a avg/std deviation could be an even simpler way of showing where this compaction approach demonstrates a win. It looks like the variance of times will be significantly higher with default and it seems that avg time would be about the same.

          Show
          Jonathan Hsieh added a comment - Sergey Shelukhin , thanks for the graph. Seeing that, I think a avg/std deviation could be an even simpler way of showing where this compaction approach demonstrates a win. It looks like the variance of times will be significantly higher with default and it seems that avg time would be about the same.
          Hide
          stack added a comment -

          Rereading the design doc and how-to-use. They are very nice. Can go into the book.

          High-level, and I think you have suggested this yourself elsewhere, it'd be coolio if user didn't have to choose between size and count – if it'd just figure itself based off incoming load.

          I've seen case where a compaction produces a zero-length file (all deletes) so would that mess w/ this invariant: "Compaction must produce at least one file (see HBASE-6059)." or "...No stripe can ever be left with 0 files..."

          I almost asked a few questions you'd already answered above in my previous read of the doc (smile).

          How would region merge work? We'd just drop all files into L0? Sounds like we'd have to drop references if we are not to break snapshotting.

          You think this true? "....stripe scheme uses larger number of files than
          default to ensure all compactions are small, which can affect very wide scans." Any measure of how much?

          Should stripe be on by default? Or have it as experimental for now until we get more data?

          How to use doc is excellent (though too many configs). Will review patch again next.

          Show
          stack added a comment - Rereading the design doc and how-to-use. They are very nice. Can go into the book. High-level, and I think you have suggested this yourself elsewhere, it'd be coolio if user didn't have to choose between size and count – if it'd just figure itself based off incoming load. I've seen case where a compaction produces a zero-length file (all deletes) so would that mess w/ this invariant: "Compaction must produce at least one file (see HBASE-6059 )." or "...No stripe can ever be left with 0 files..." I almost asked a few questions you'd already answered above in my previous read of the doc (smile). How would region merge work? We'd just drop all files into L0? Sounds like we'd have to drop references if we are not to break snapshotting. You think this true? "....stripe scheme uses larger number of files than default to ensure all compactions are small, which can affect very wide scans." Any measure of how much? Should stripe be on by default? Or have it as experimental for now until we get more data? How to use doc is excellent (though too many configs). Will review patch again next.
          Hide
          Sergey Shelukhin added a comment -

          I'm looking at how to merge policies. Unfortunately splitting uses more I/O than not splitting (who would have though... ), resulting in worse perf. Also, the system cannot really predict future data patterns, no more than region splitting can do it (at least not with a lot of complexity added), so hint flag for how to split would need to be provided.

          W.r.t. producing files to contain metadata, that is unfortunately necessary. These files shouldn't have effect. Stripes with only expired files can be merged. I've taken a stab at auto-detecting stripes from file metadata, in general case it's very complex, in simplified realistic case it's just complex.

          Merge will drop everything into L0, yes. This could be improved, but has to be done now anyway due to references, same with split, so no need to do it now.

          On-by-default would require smart default settings.

          Let me comment tomorrow on HBASE-7680, if I can make a size-count hybrid quickly I will post final patch without a lot of logic changes there, and hopefully we can commit 3 initial patches and build on top of that.

          Show
          Sergey Shelukhin added a comment - I'm looking at how to merge policies. Unfortunately splitting uses more I/O than not splitting (who would have though... ), resulting in worse perf. Also, the system cannot really predict future data patterns, no more than region splitting can do it (at least not with a lot of complexity added), so hint flag for how to split would need to be provided. W.r.t. producing files to contain metadata, that is unfortunately necessary. These files shouldn't have effect. Stripes with only expired files can be merged. I've taken a stab at auto-detecting stripes from file metadata, in general case it's very complex, in simplified realistic case it's just complex. Merge will drop everything into L0, yes. This could be improved, but has to be done now anyway due to references, same with split, so no need to do it now. On-by-default would require smart default settings. Let me comment tomorrow on HBASE-7680 , if I can make a size-count hybrid quickly I will post final patch without a lot of logic changes there, and hopefully we can commit 3 initial patches and build on top of that.
          Hide
          stack added a comment -

          splitting uses more I/O than not splitting

          Sorry. You mean stripes uses more i/o because we L0 first then rewrite into stripes?

          Show
          stack added a comment - splitting uses more I/O than not splitting Sorry. You mean stripes uses more i/o because we L0 first then rewrite into stripes?
          Hide
          Sergey Shelukhin added a comment -

          Size-based scheme works by splitting stripes when they grow big. This splitting is good for sequential sharded keys, because lower part of the split is written as one file, and doesn't receive new data (or doesn't receive a lot of it anyway), so it doesn't have to participate in compactions. If you have uniform data, splitting result in more rewriting and both stripes keep growing after the split.

          Show
          Sergey Shelukhin added a comment - Size-based scheme works by splitting stripes when they grow big. This splitting is good for sequential sharded keys, because lower part of the split is written as one file, and doesn't receive new data (or doesn't receive a lot of it anyway), so it doesn't have to participate in compactions. If you have uniform data, splitting result in more rewriting and both stripes keep growing after the split.
          Hide
          Sergey Shelukhin added a comment -

          So it's easier to configure (just say I want 500Mb-1Gb-... stripes) but in the net, results in more I/O during initial data population before region reaches stable size.

          Show
          Sergey Shelukhin added a comment - So it's easier to configure (just say I want 500Mb-1Gb-... stripes) but in the net, results in more I/O during initial data population before region reaches stable size.
          Hide
          Sergey Shelukhin added a comment -

          I have run the maven tests and rebased all patches (no changes except a tiny one in compator)... if no objections I will commit HBASE-7679, HBASE-7680, HBASE-7967, HBASE-8000 to trunk today afternoon if there are no objections by then.

          Show
          Sergey Shelukhin added a comment - I have run the maven tests and rebased all patches (no changes except a tiny one in compator)... if no objections I will commit HBASE-7679 , HBASE-7680 , HBASE-7967 , HBASE-8000 to trunk today afternoon if there are no objections by then.
          Hide
          stack added a comment -

          So, stripe compactions does more i/o unless it is the time series use case? I cannot turn this on by default? Where do I go to read on benefits of this new addition? Thanks Sergey Shelukhin

          Show
          stack added a comment - So, stripe compactions does more i/o unless it is the time series use case? I cannot turn this on by default? Where do I go to read on benefits of this new addition? Thanks Sergey Shelukhin
          Hide
          Sergey Shelukhin added a comment -

          Yes, unless there's non-uniform key access there will be more, but smaller, compactions.
          HBASE-8541 makes for less IO amplification, Enis is reviewing it, it will probably follow quickly.

          You probably don't want to turn it on by default, as it makes sense either for non-uniform data or for large regions.

          The documents, one of them targeted at users (ans all of them out of date), are attached to this very JIRA
          The configuration was simplified quite a bit compared to the state of the doc.
          Let me file a JIRA to document things in e.g. the book.
          For the first release it will be positioned as an "experimental" feature...

          Show
          Sergey Shelukhin added a comment - Yes, unless there's non-uniform key access there will be more, but smaller, compactions. HBASE-8541 makes for less IO amplification, Enis is reviewing it, it will probably follow quickly. You probably don't want to turn it on by default, as it makes sense either for non-uniform data or for large regions. The documents, one of them targeted at users (ans all of them out of date), are attached to this very JIRA The configuration was simplified quite a bit compared to the state of the doc. Let me file a JIRA to document things in e.g. the book. For the first release it will be positioned as an "experimental" feature...
          Hide
          stack added a comment -

          Yes, unless there's non-uniform key access there will be more, but smaller, compactions.

          As per M. C. Srivas supposition above I suppose.

          You probably don't want to turn it on by default, as it makes sense either for non-uniform data or for large regions.

          So for what cases should we turn it on?

          The documents, one of them targeted at users (ans all of them out of date), are attached to this very JIRA

          Pardon me. I've reviewed a bunch of this feature – docs and code – and am just having trouble quantifying the benefit this slew of new code brings in. Sorry if I am being thick.

          If I read the attached user doc, will it be clear (though it is out of date?) Let me try it.

          Show
          stack added a comment - Yes, unless there's non-uniform key access there will be more, but smaller, compactions. As per M. C. Srivas supposition above I suppose. You probably don't want to turn it on by default, as it makes sense either for non-uniform data or for large regions. So for what cases should we turn it on? The documents, one of them targeted at users (ans all of them out of date), are attached to this very JIRA Pardon me. I've reviewed a bunch of this feature – docs and code – and am just having trouble quantifying the benefit this slew of new code brings in. Sorry if I am being thick. If I read the attached user doc, will it be clear (though it is out of date?) Let me try it.
          Hide
          stack added a comment -

          If I read the user doc., it says:

          "This improves read performance in common scenarios and greatly reduces variability, by avoiding large and/or inefficient compactions"

          If I read further, there is no clear message on when enabling stripe compactions makes sense. Doc is missing a section on when NOT to use stripe compactions.

          The doc. has detail but seems like a bunch no longer applies after recent reworkings (as you say above).

          Do you have some stats on how it can improve life under certain workloads and what those workloads are?

          My concern is that a bunch of code will go into hbase and it will sit there unused. We have enough of that already. I'd like to have some clear messaging around this feature, both how it can benefit, and also how a user could enable it and see the effects of its workings.

          Show
          stack added a comment - If I read the user doc., it says: "This improves read performance in common scenarios and greatly reduces variability, by avoiding large and/or inefficient compactions" If I read further, there is no clear message on when enabling stripe compactions makes sense. Doc is missing a section on when NOT to use stripe compactions. The doc. has detail but seems like a bunch no longer applies after recent reworkings (as you say above). Do you have some stats on how it can improve life under certain workloads and what those workloads are? My concern is that a bunch of code will go into hbase and it will sit there unused. We have enough of that already. I'd like to have some clear messaging around this feature, both how it can benefit, and also how a user could enable it and see the effects of its workings.
          Hide
          Sergey Shelukhin added a comment -

          stack I understand your concern. I will get to documenting it this or next week, so it should be able to get at least into 98.
          As far as I know, there are some people who wanted to try it out,and also there's timestamp compaction jira which might be obviated... let's see if as experimental feature it can get adoption. It's pretty well isolated now, so it should be easy to remove later, or move into separate module out of the way.

          Show
          Sergey Shelukhin added a comment - stack I understand your concern. I will get to documenting it this or next week, so it should be able to get at least into 98. As far as I know, there are some people who wanted to try it out,and also there's timestamp compaction jira which might be obviated... let's see if as experimental feature it can get adoption. It's pretty well isolated now, so it should be easy to remove later, or move into separate module out of the way.
          Hide
          Sergey Shelukhin added a comment -

          All the pertinent patches have been committed for some time (before 98 was branched).

          Show
          Sergey Shelukhin added a comment - All the pertinent patches have been committed for some time (before 98 was branched).
          Hide
          Otis Gospodnetic added a comment -

          Btw. is this going to get into any 0.96.x releases by any chance? Thanks.

          Show
          Otis Gospodnetic added a comment - Btw. is this going to get into any 0.96.x releases by any chance? Thanks.
          Hide
          Sergey Shelukhin added a comment -

          With 98 coming so soon, probably not.stack wdyt?

          Show
          Sergey Shelukhin added a comment - With 98 coming so soon, probably not. stack wdyt?
          Hide
          stack added a comment -

          Yeah. Too late for 0.96. We are trying to get back on to a "bugs-only" in point releases praxis – unless there a citizen revolt. Also need reason for folks to upgrade to 0.98!

          Show
          stack added a comment - Yeah. Too late for 0.96. We are trying to get back on to a "bugs-only" in point releases praxis – unless there a citizen revolt. Also need reason for folks to upgrade to 0.98!
          Hide
          Andrew Purtell added a comment -

          Any doc updates available? 0.98.0RC1 is open.

          Show
          Andrew Purtell added a comment - Any doc updates available? 0.98.0RC1 is open.
          Hide
          Sergey Shelukhin added a comment -

          the docs were committed to the book as part of HBASE-9854; sorry for late reply

          Show
          Sergey Shelukhin added a comment - the docs were committed to the book as part of HBASE-9854 ; sorry for late reply
          Hide
          Enis Soztutar added a comment -

          Closing this issue after 0.99.0 release.

          Show
          Enis Soztutar added a comment - Closing this issue after 0.99.0 release.

            People

            • Assignee:
              Sergey Shelukhin
              Reporter:
              Sergey Shelukhin
            • Votes:
              0 Vote for this issue
              Watchers:
              42 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development