[HBASE-6040] Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.94.0, 0.95.2
Fix Version/s: 0.94.1
Component/s: mapreduce
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
Added a new config param "hbase.mapreduce.hfileoutputformat.datablock.encoding" using which we can specify which encoding scheme to be used on disk. Data will get written in to HFiles using this encoding scheme while bulk load. The value of this can be NONE, PREFIX, DIFF, FAST_DIFF as these are the DataBlockEncoding types supported now. [When any new types are added later, corresponding names also will become valid]
The checksum type and number of bytes per checksum can be configured using the config params hbase.hstore.checksum.algorithm, hbase.hstore.bytes.per.checksum respectively

Show
Added a new config param "hbase.mapreduce.hfileoutputformat.datablock.encoding" using which we can specify which encoding scheme to be used on disk. Data will get written in to HFiles using this encoding scheme while bulk load. The value of this can be NONE, PREFIX, DIFF, FAST_DIFF as these are the DataBlockEncoding types supported now. [When any new types are added later, corresponding names also will become valid] The checksum type and number of bytes per checksum can be configured using the config params hbase.hstore.checksum.algorithm, hbase.hstore.bytes.per.checksum respectively
Tags:
bulkload

Description

When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features.. When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..

Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-6040_Trunk.patch
31/May/12 03:18
3 kB
Anoop Sam John
HBASE-6040_94.patch
27/May/12 17:52
3 kB
Anoop Sam John

Issue Links

relates to

HBASE-6868 Skip checksum is broke; are we double-checksumming by default?

Closed

HBASE-6164 Correct the bug in block encoding usage in bulkload

Closed

Activity

People

Assignee:: Anoop Sam John

Reporter:: Anoop Sam John

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 17/May/12 19:44

Updated:: 12/Oct/12 05:36

Resolved:: 31/May/12 04:27