[HBASE-8768] Improve bulk load performance by moving key value construction from map phase to reduce phase. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.98.0, 0.95.2
Component/s: mapreduce, Performance
Labels:
None

Hadoop Flags:

Reviewed

Description

ImportTSV bulkloading approach uses MapReduce framework. Existing mapper and reducer classes used by ImportTSV are TsvImporterMapper.java and PutSortReducer.java. ImportTSV tool parses the tab(by default) seperated values from the input files and Mapper class generates the PUT objects for each row using the Key value pairs created from the parsed text. PutSortReducer then uses the partions based on the regions and sorts the Put objects for each region.

Overheads we can see in the above approach:
==========================================
1) keyvalue construction for each parsed value in the line adding extra data like rowkey,columnfamily,qualifier which will increase around 5x extra data to be shuffled in reduce phase.
We can calculate data size to shuffled as below

 Data to be shuffled = nl*nt*(rl+cfl+cql+vall+tsl+30)

If we move keyvalue construction to reduce phase we datasize to be shuffle will be which is very less compared to above.

 Data to be shuffled = nl*nt*vall

nl - Number of lines in the raw file
nt - Number of tabs or columns including row key.
rl - row length which will be different for each line.
cfl - column family length which will be different for each family
cql - qualifier length
tsl - timestamp length.
vall- each parsed value length.
30 bytes for kv size,number of families etc.

2) In mapper side we are creating put objects by adding all keyvalues constructed for each line and in reducer we will again collect keyvalues from put and sort them.
Instead we can directly create and sort keyvalues in reducer.

Solution:
========
We can improve bulk load performance by moving the key value construction from mapper to reducer so that Mapper just sends the raw text for each row to the Reducer. Reducer then parses the records for rows and create and sort the key value pairs before writing to HFiles.
Conclusion:
===========
The above suggestions will improve map phase performance by avoiding keyvalue construction and reduce phase performance by avoiding excess data to be shuffled.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-8768_v4.patch
31/Jul/13 09:07
21 kB
rajeshbabu
HBASE-8768_v3.patch
29/Jul/13 06:47
21 kB
rajeshbabu
HBASE-8768_v2.patch
26/Jul/13 12:29
21 kB
rajeshbabu
HBase_Bulkload_Performance_Improvement.pdf
26/Jul/13 12:36
659 kB
rajeshbabu

Activity

People

Assignee:: rajeshbabu

Reporter:: rajeshbabu

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 19/Jun/13 05:47

Updated:: 23/Sep/13 19:22

Resolved:: 01/Aug/13 16:22