[YARN-9395] Short Names for repeated Hbase Column names - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.2.0
Fix Version/s: None
Component/s: ATSv2
Labels:
None

Description

Currently ATS HBase tables stores the config name / metric name as column names which are long. This repeats for all the rows and consumes lot of storage space. And we have seen Customers Hbase Tables already consumes more than 1.5 TB in few days

Example Configs:
c:yarn.timeline-service.webapp.rest-csrf.methods-to-ignore
c:yarn.timeline-service.entity-group-fs-store.active-dir
c:yarn.scheduler.configuration.zk-store.parent-path

Example Metrics:
m:REDUCE:org.apache.hadoop.mapreduce.FileSystemCounter:HDFS_READ_OPS
m:REDUCE:org.apache.hadoop.mapreduce.TaskCounter:COMBINE_INPUT_RECORDS
m:REDUCE:org.apache.hadoop.mapreduce.TaskCounter:PHYSICAL_MEMORY_BYTES

We need to use short column names as per Hbase Best Practice - http://moi.vonos.net/bigdata/avro-hbase-colnames/ But the challenge is ATS does not know the column names until the rows get inserted. We can provide a mapping file to map the repeated configs / metrics / info from different applications to unique numbers which customers can configure upfront to save the storage space. Similar to what Phoenix does

https://blogs.apache.org/phoenix/entry/column-mapping-and-immutable-data
https://phoenix.apache.org/columnencoding.html

Attachments

Activity

People

Assignee:: Prabhu Joseph

Reporter:: Prabhu Joseph

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Mar/19 06:45

Updated:: 22/Mar/19 06:00