Uploaded image for project: 'Chukwa'
  1. Chukwa
  2. CHUKWA-667

Optimize the HBase schema for Ganglia queris

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.7.0
    • Component/s: Data Processors
    • Labels:
      None

      Description

      Chukwa HBase table schema is designed for HICC, it cannot be fully adapted to Ganglia web frontend for several reasons:
      (1) cannot fastly retrieve all the cluster and related host names.
      (2) system metrics have no attributes, like type, unit, so it is hard to explain the collected metrics by code.
      (3) lack of data cosolidate function, choosing metric for a large time range (like 30 days) will fetch all the data and draw graph, which will largely lose performance.

      We will redesign the table schema that will be better adapted to Ganglia web frontend queries.

        Issue Links

          Activity

          Hide
          eyang Eric Yang added a comment -

          Resume progress on this issue. We have learnt a few lessons on metrics schema design in the last couple years. Monotonic increasing row key is bad for HBase region server. We have built an alternate HBase schema, like this:

          [group name].[date].[metric]:[primary key], and column family "m" and cell name "m". This provides a way to prefix split of regions, but it also isn't great. The advantage is to have one table for all type of metrics. We observed that some group may have metrics growing faster than other group, therefore, the region split still need a lot of manual maintenance to prevent HBase from blowing up.

          A new proposal is to change metrics schema design to partition by day of the month. More than often that time series database have two requirements, fast lookup by time and fast lookup for the same metrics. This means the row key need to have hints for partition by day and partition by primary key. A improved schema can be generated by:

          Table: [group name]
          Row Key: [day:primary_key]
          Column Family: [subgroup name]
          Column: [metric name]
          Timestamp: [actual timestamp]

          Example of a Hadoop table would look like:

          Table: Hadoop
          Row Key: 13:host1.example.com
          Column Family: HDFS
          Column: datanode_bytes_read
          Timestamp: 1234567890
          Value: 123

          Units, and metrics type can be stored in a secondary table for rendering and metadata lookup to reduce storage space. Thoughts?

          Show
          eyang Eric Yang added a comment - Resume progress on this issue. We have learnt a few lessons on metrics schema design in the last couple years. Monotonic increasing row key is bad for HBase region server. We have built an alternate HBase schema, like this: [group name] . [date] . [metric] : [primary key] , and column family "m" and cell name "m". This provides a way to prefix split of regions, but it also isn't great. The advantage is to have one table for all type of metrics. We observed that some group may have metrics growing faster than other group, therefore, the region split still need a lot of manual maintenance to prevent HBase from blowing up. A new proposal is to change metrics schema design to partition by day of the month. More than often that time series database have two requirements, fast lookup by time and fast lookup for the same metrics. This means the row key need to have hints for partition by day and partition by primary key. A improved schema can be generated by: Table: [group name] Row Key: [day:primary_key] Column Family: [subgroup name] Column: [metric name] Timestamp: [actual timestamp] Example of a Hadoop table would look like: Table: Hadoop Row Key: 13:host1.example.com Column Family: HDFS Column: datanode_bytes_read Timestamp: 1234567890 Value: 123 Units, and metrics type can be stored in a secondary table for rendering and metadata lookup to reduce storage space. Thoughts?
          Hide
          eyang Eric Yang added a comment -

          The above schema is optimized to do a single pass scan with prefix datetime that fits in one day. For historical trending, the scan of such table would not be good. The solution to summarize the raw data into a monthly and yearly table is to run map reduce daily to populate hourly and daily averages into secondary table. Hence, the Row Key "day" prefix maybe adjusted to "01-12" for monthly table, and "1-9" for yearly table. Base on the time range selected for query, the query API can switch between monthly and yearly table. This would have similar effect as rrdtools which the data in the further distant past is lower resolution aggregates, but using HBase provided timestamp versions and compaction to achieve data retention.

          We may want to do some sort of algorithm hashing for primary_key to ensure we have good fix length for primary key and even distribution.

          Show
          eyang Eric Yang added a comment - The above schema is optimized to do a single pass scan with prefix datetime that fits in one day. For historical trending, the scan of such table would not be good. The solution to summarize the raw data into a monthly and yearly table is to run map reduce daily to populate hourly and daily averages into secondary table. Hence, the Row Key "day" prefix maybe adjusted to "01-12" for monthly table, and "1-9" for yearly table. Base on the time range selected for query, the query API can switch between monthly and yearly table. This would have similar effect as rrdtools which the data in the further distant past is lower resolution aggregates, but using HBase provided timestamp versions and compaction to achieve data retention. We may want to do some sort of algorithm hashing for primary_key to ensure we have good fix length for primary key and even distribution.
          Hide
          sree2k Sreepathi Prasanna added a comment -

          Hello Eric,

          I have read from HBASE guide, that its not optimized for multiple column families.

          Using the below schema, would result in many column families.
          Row Key: 13:host1.example.com
          Column Family: HDFS
          Column: datanode_bytes_read

          Also, all the queries both read and write, might hit the same row at the same time.

          How about adding the metric name at the end or starting of row key, For example,
          Row Key: 13|host1.example.com|datanode_bytes_read.

          Further we can save some space by assigning numbers to each of metrics. For example, if datanode_bytes_read is assigned a number lets say 100, then row key would become,
          Row Key : 13|host1.example.com|100

          and the mapping between metric name and number might be maintained in a separate table. This saves some space in HBase.

          Show
          sree2k Sreepathi Prasanna added a comment - Hello Eric, I have read from HBASE guide, that its not optimized for multiple column families. Using the below schema, would result in many column families. Row Key: 13:host1.example.com Column Family: HDFS Column: datanode_bytes_read Also, all the queries both read and write, might hit the same row at the same time. How about adding the metric name at the end or starting of row key, For example, Row Key: 13|host1.example.com|datanode_bytes_read. Further we can save some space by assigning numbers to each of metrics. For example, if datanode_bytes_read is assigned a number lets say 100, then row key would become, Row Key : 13|host1.example.com|100 and the mapping between metric name and number might be maintained in a separate table. This saves some space in HBase.
          Hide
          eyang Eric Yang added a comment -

          Hi Sreepathi,

          In general, column family is partitioned per directory in HDFS. The common access pattern is by column instead of by row. Therefore, using more than 1 column family is fine as long as the scan is from the same column family. there is no performance penalty. However, using secondary table to store the metric name has it's own problems. The ID needs to be padded. Otherwise it is possible to get "|1000" from composed query of "|100", if the keys are not padded with the same length. When storing large number of key types, padding only take slightly less storage than direct store of metric name in cell name instead. However the lookup time is faster to have shorter row key to locate region, and linearly deserializing data from the same data blocks, 2 connection requests to decode ID then linear scan of row key to find the closest row key to start. Linear row key scan is slower than grabbing block of data from a column family for a row.

          Show
          eyang Eric Yang added a comment - Hi Sreepathi, In general, column family is partitioned per directory in HDFS. The common access pattern is by column instead of by row. Therefore, using more than 1 column family is fine as long as the scan is from the same column family. there is no performance penalty. However, using secondary table to store the metric name has it's own problems. The ID needs to be padded. Otherwise it is possible to get "|1000" from composed query of "|100", if the keys are not padded with the same length. When storing large number of key types, padding only take slightly less storage than direct store of metric name in cell name instead. However the lookup time is faster to have shorter row key to locate region, and linearly deserializing data from the same data blocks, 2 connection requests to decode ID then linear scan of row key to find the closest row key to start. Linear row key scan is slower than grabbing block of data from a column family for a row.
          Hide
          eyang Eric Yang added a comment -

          Revising the schema one more time. Having time in timestamp only is difficult to work with high level languages such as Hive or Pig because their load and store function doesn't work with HBase time stamp value.

          Table: chukwa
          Row Key: [day:primary_key]
          Column Family: t
          Column: [timestamp]
          Value: [value]
          Column Family: a
          Column: [timestamp]
          Value: [tags]
          Timestamp: [timestamp]

          This schema allows to add annotation and tags to timestamp, and user can choose to query or not query the tags. We still update timestamp field for ttl to work.

          The benefit of having day in front of row key is easier to split the regions, this provides better uniformed data distribution among regions. Use a separate column family for annotation provide ability to annotate only a specific timestamp.

          Show
          eyang Eric Yang added a comment - Revising the schema one more time. Having time in timestamp only is difficult to work with high level languages such as Hive or Pig because their load and store function doesn't work with HBase time stamp value. Table: chukwa Row Key: [day:primary_key] Column Family: t Column: [timestamp] Value: [value] Column Family: a Column: [timestamp] Value: [tags] Timestamp: [timestamp] This schema allows to add annotation and tags to timestamp, and user can choose to query or not query the tags. We still update timestamp field for ttl to work. The benefit of having day in front of row key is easier to split the regions, this provides better uniformed data distribution among regions. Use a separate column family for annotation provide ability to annotate only a specific timestamp.
          Hide
          eyang Eric Yang added a comment -

          Further enhancement to the primary key section,

          The first 6 digits are md5 prefix of hashing group name. Then follow by 6 digits of hashing primary key name.

          Example of this would be:

          Hadoop.dfs.datanode.byteRead:host1.example.com

          Where Hadoop.dfs.datanode.byteRead and host1.example.com are collocated primary keys, but when doing computation aggregating metrics by host, Hadoop.dfs.datanode.byteRead is used to compute the aggregate. Hadoop.dfs.datanode.byteRead has somewhat significant value in the computation. Hence, we generate the primary key as:

          Hadoop.dfs.datanode.byteRead = 21da46
          host1.example.com = a026db

          For day 269 of the year, the row key would appear as:

          26921da46a026db

          This enable programmer to customize rowFilter to get either the more significant part of the primary key or the least significant part of the primary key. thoughts?

          Show
          eyang Eric Yang added a comment - Further enhancement to the primary key section, The first 6 digits are md5 prefix of hashing group name. Then follow by 6 digits of hashing primary key name. Example of this would be: Hadoop.dfs.datanode.byteRead:host1.example.com Where Hadoop.dfs.datanode.byteRead and host1.example.com are collocated primary keys, but when doing computation aggregating metrics by host, Hadoop.dfs.datanode.byteRead is used to compute the aggregate. Hadoop.dfs.datanode.byteRead has somewhat significant value in the computation. Hence, we generate the primary key as: Hadoop.dfs.datanode.byteRead = 21da46 host1.example.com = a026db For day 269 of the year, the row key would appear as: 26921da46a026db This enable programmer to customize rowFilter to get either the more significant part of the primary key or the least significant part of the primary key. thoughts?
          Hide
          eyang Eric Yang added a comment -

          For managing meta data of primary key composition, a tertiary table would be a good idea to track of the meta data. The structure of the table could look like this:

          Table: chukwa_meta
          Row Key: [primary group]
          Column Family: k
          Column: [primary key]
          Value:

          { "unit": "kb", "type": "gauge", "sig": [first 6 digit of md5 signature] }
          Show
          eyang Eric Yang added a comment - For managing meta data of primary key composition, a tertiary table would be a good idea to track of the meta data. The structure of the table could look like this: Table: chukwa_meta Row Key: [primary group] Column Family: k Column: [primary key] Value: { "unit": "kb", "type": "gauge", "sig": [first 6 digit of md5 signature] }
          Hide
          eyang Eric Yang added a comment -

          Example of this:

          Table: chukwa_meta
          Row Key: Hadoop.dfs.datanode.byteRead
          Column Family: k
          Column: host1.example.com
          Value:

          { "unit": "kb", "type": "gauge", "sig": "a025db" }
          Show
          eyang Eric Yang added a comment - Example of this: Table: chukwa_meta Row Key: Hadoop.dfs.datanode.byteRead Column Family: k Column: host1.example.com Value: { "unit": "kb", "type": "gauge", "sig": "a025db" }
          Hide
          eyang Eric Yang added a comment -

          Updated Chukwa HBase schema to these format:

          Table: CHUKWA_META
          RowKey: metricGroup
          Column Family: k
          Column Name: metric name or primary key
          Cell Value:

          { "sig": "md5 signature", "type": "source|metric|cluster" }

          Table: CHUKWA
          RowKey: [DAY][METRIC_KEY_MD5][PRIMARY_KEY_MD5]
          Column Family: t
          Column Name: [timestamp]
          Cell Value: [STRING]
          Column Family: a
          Column Name: [timestamp]
          Cell Value: [TAGS]

          This provides round robin partitions and good hash lookup of time range values. Please keep in mind that this storage format works for metrics data, not logs.

          Show
          eyang Eric Yang added a comment - Updated Chukwa HBase schema to these format: Table: CHUKWA_META RowKey: metricGroup Column Family: k Column Name: metric name or primary key Cell Value: { "sig": "md5 signature", "type": "source|metric|cluster" } Table: CHUKWA RowKey: [DAY] [METRIC_KEY_MD5] [PRIMARY_KEY_MD5] Column Family: t Column Name: [timestamp] Cell Value: [STRING] Column Family: a Column Name: [timestamp] Cell Value: [TAGS] This provides round robin partitions and good hash lookup of time range values. Please keep in mind that this storage format works for metrics data, not logs.
          Hide
          sree2k Sreepathi Prasanna added a comment -

          I like the idea of separating the metadata and the metrics data itself into two different tables. this saves lot of space.

          Regarding row key:

          day+6 digits of md5(metricgroup+metric)+6 digits of md5(host)

          I kind of agree to this design, but had a question. Does that mean all metrics collected per minute on the same day would hit the same row? is it performant? Also if you are aggregating the data every 15 mins, wouldn't that cause load on the same rows where writes are happening?

          Show
          sree2k Sreepathi Prasanna added a comment - I like the idea of separating the metadata and the metrics data itself into two different tables. this saves lot of space. Regarding row key: day+6 digits of md5(metricgroup+metric)+6 digits of md5(host) I kind of agree to this design, but had a question. Does that mean all metrics collected per minute on the same day would hit the same row? is it performant? Also if you are aggregating the data every 15 mins, wouldn't that cause load on the same rows where writes are happening?
          Hide
          eyang Eric Yang added a comment -

          Hi Sreepathi,

          Metrics for the whole day will update the same row. However, row is just a reference pointer to the actual data block. This reduces the number of lookup to the data block. Cell appends to the new data in memory or WAL log and spill to disk during compaction. This design reduces the stress point of monotonic increasing index. It will reach optimal balanced regions after 1 year of running because we partition by day. Partition by numeric number is better than metric group prefix because metric group prefix can generate uneven size of regions because some metric group contains more metrics than others. For this reason, the design added day as prefix of the row key.

          Show
          eyang Eric Yang added a comment - Hi Sreepathi, Metrics for the whole day will update the same row. However, row is just a reference pointer to the actual data block. This reduces the number of lookup to the data block. Cell appends to the new data in memory or WAL log and spill to disk during compaction. This design reduces the stress point of monotonic increasing index. It will reach optimal balanced regions after 1 year of running because we partition by day. Partition by numeric number is better than metric group prefix because metric group prefix can generate uneven size of regions because some metric group contains more metrics than others. For this reason, the design added day as prefix of the row key.
          Hide
          eyang Eric Yang added a comment -

          I just committed this.

          Show
          eyang Eric Yang added a comment - I just committed this.

            People

            • Assignee:
              eyang Eric Yang
              Reporter:
              jerryshao Saisai Shao
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development