Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-15181

A simple implementation of date based tiered compaction

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0, 0.98.18, 2.0.0
    • Component/s: Compaction
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Date tiered compaction policy is a date-aware store file layout that is beneficial for time-range scans for time-series data.

      When it performs well:

          reads for limited time ranges, especially scans of recent data

      When it doesn't perform as well:

          random gets without a time range
          frequent deletes and updates
          out of order data writes, especially writes with timestamps in the future
          bulk loads of historical data

      Recommended configuration:
      To turn on Date Tiered Compaction (It is not recommended to turn on for the whole cluster because that will put meta table on it too and random get on meta table will be impacted):
      hbase.hstore.compaction.compaction.policy: org.apache.hadoop.hbase.regionserver.compactions.DateTieredCompactionPolicy

      Parameters for Date Tiered Compaction:
      hbase.hstore.compaction.date.tiered.max.storefile.age.millis: Files with max-timestamp smaller than this will no longer be compacted.Default at Long.MAX_VALUE.
      hbase.hstore.compaction.date.tiered.base.window.millis: base window size in milliseconds. Default at 6 hours.
      hbase.hstore.compaction.date.tiered.windows.per.tier: number of windows per tier. Default at 4.
      hbase.hstore.compaction.date.tiered.incoming.window.min: minimal number of files to compact in the incoming window. Set it to expected number of files in the window to avoid wasteful compaction. Default at 6.
      hbase.hstore.compaction.date.tiered.window.policy.class: the policy to select store files within the same time window. It doesn’t apply to the incoming window. Default at exploring compaction. This is to avoid wasteful compaction.

      With tiered compaction all servers in the cluster will promote windows to higher tier at the same time, so using a compaction throttle is recommended:
      hbase.regionserver.throughput.controller:org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController

      Because there will most likely be more store files around, we need to adjust the configuration so that flush won't be blocked and compaction will be properly throttled:
      hbase.hstore.blockingStoreFiles: change to 50 if using all default parameters when turning on date tiered compaction. Use 1.5~2 x projected file count if changing the parameters, Projected file count = windows per tier x tier count + incoming window min + files older than max age

      For more details, please refer to the design spec at https://docs.google.com/document/d/1_AmlNb2N8Us1xICsTeGDLKIqL6T-oHoRLZ323MG_uy8/edit#
      Show
      Date tiered compaction policy is a date-aware store file layout that is beneficial for time-range scans for time-series data. When it performs well:     reads for limited time ranges, especially scans of recent data When it doesn't perform as well:     random gets without a time range     frequent deletes and updates     out of order data writes, especially writes with timestamps in the future     bulk loads of historical data Recommended configuration: To turn on Date Tiered Compaction (It is not recommended to turn on for the whole cluster because that will put meta table on it too and random get on meta table will be impacted): hbase.hstore.compaction.compaction.policy: org.apache.hadoop.hbase.regionserver.compactions.DateTieredCompactionPolicy Parameters for Date Tiered Compaction: hbase.hstore.compaction.date.tiered.max.storefile.age.millis: Files with max-timestamp smaller than this will no longer be compacted.Default at Long.MAX_VALUE. hbase.hstore.compaction.date.tiered.base.window.millis: base window size in milliseconds. Default at 6 hours. hbase.hstore.compaction.date.tiered.windows.per.tier: number of windows per tier. Default at 4. hbase.hstore.compaction.date.tiered.incoming.window.min: minimal number of files to compact in the incoming window. Set it to expected number of files in the window to avoid wasteful compaction. Default at 6. hbase.hstore.compaction.date.tiered.window.policy.class: the policy to select store files within the same time window. It doesn’t apply to the incoming window. Default at exploring compaction. This is to avoid wasteful compaction. With tiered compaction all servers in the cluster will promote windows to higher tier at the same time, so using a compaction throttle is recommended: hbase.regionserver.throughput.controller:org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController Because there will most likely be more store files around, we need to adjust the configuration so that flush won't be blocked and compaction will be properly throttled: hbase.hstore.blockingStoreFiles: change to 50 if using all default parameters when turning on date tiered compaction. Use 1.5~2 x projected file count if changing the parameters, Projected file count = windows per tier x tier count + incoming window min + files older than max age For more details, please refer to the design spec at https://docs.google.com/document/d/1_AmlNb2N8Us1xICsTeGDLKIqL6T-oHoRLZ323MG_uy8/edit#

      Description

      This is a simple implementation of date-based tiered compaction similar to Cassandra's for the following benefits:
      1. Improve date-range-based scan by structuring store files in date-based tiered layout.
      2. Reduce compaction overhead.
      3. Improve TTL efficiency.

      Perfect fit for the use cases that:
      1. has mostly date-based date write and scan and a focus on the most recent data.
      2. never or rarely deletes data.

      Out-of-order writes are handled gracefully. Time range overlapping among store files is tolerated and the performance impact is minimized.

      Configuration can be set at hbase-site.xml or overriden at per-table or per-column-famly level by hbase shell.

      Design spec is at https://docs.google.com/document/d/1_AmlNb2N8Us1xICsTeGDLKIqL6T-oHoRLZ323MG_uy8/edit?usp=sharing
      Results in our production is at https://docs.google.com/document/d/1GqRtQZMMkTEWOijZc8UCTqhACNmdxBSjtAQSYIWsmGU/edit#

        Attachments

        1. HBASE-15181-v1.patch
          40 kB
          Clara Xiong
        2. HBASE-15181-v2.patch
          46 kB
          Clara Xiong
        3. HBASE-15181-master-v1.patch
          51 kB
          Clara Xiong
        4. HBASE-15181-master-v2.patch
          51 kB
          Clara Xiong
        5. HBASE-15181-master-v3.patch
          51 kB
          Clara Xiong
        6. HBASE-15181-master-v4.patch
          51 kB
          Clara Xiong
        7. HBASE-15181-branch-1.patch
          51 kB
          Clara Xiong
        8. HBASE-15181-98.patch
          50 kB
          Clara Xiong
        9. HBASE-15181-0.98.v4.patch
          50 kB
          Ted Yu
        10. HBASE-15181-0.98.patch
          51 kB
          Clara Xiong
        11. HBASE-15181-0.98-ADD.patch
          4 kB
          Clara Xiong
        12. HBASE-15181-ADD.patch
          3 kB
          Clara Xiong

          Issue Links

            Activity

              People

              • Assignee:
                claraxiong Clara Xiong
                Reporter:
                claraxiong Clara Xiong
              • Votes:
                0 Vote for this issue
                Watchers:
                30 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: