[HBASE-8073] HFileOutputFormat support for offline operation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Abandoned
Affects Version/s: None
Fix Version/s: None
Component/s: mapreduce
Labels:
None

Description

When using HFileOutputFormat to generate HFiles, it inspects the region topology of the target table. The split points from that table are used to guide the TotalOrderPartitioner. If the target table does not exist, it is first created. This imposes an unnecessary dependence on an online HBase and existing table.

If the table exists, it can be used. However, the job can be smarter. For example, if there's far more data going into the HFiles than the table currently contains, the table regions aren't very useful for data split points. Instead, the input data can be sampled to produce split points more meaningful to the dataset. LoadIncrementalHFiles is already capable of handling divergence between HFile boundaries and table regions, so this should not pose any additional burdon at load time.

The proper method of sampling the data likely requires a custom input format and an additional map-reduce job perform the sampling. See a relevant implementation: https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch4/sampler/ReservoirSamplerInputFormat.java

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-8073-trunk-v1.patch
02/Jun/14 05:47
16 kB
Jerry He
HBASE-8073-trunk-v0.patch
31/May/14 01:28
15 kB
Jerry He

Issue Links

relates to

HBASE-11170 Provide option for WALPlayer not to rely on live HBase table

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Nick Dimiduk

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 12/Mar/13 00:37

Updated:: 16/Jun/22 16:38

Resolved:: 16/Jun/22 16:38