[HBASE-28440] Add support for using mapreduce sort in HFileOutputFormat2 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: backup&restore
Labels:
None

Description

Currently HFileOutputFormat2 uses CellSortReducer, which attempts to sort all of the cells of a row in memory using a TreeSet. There is a warning in the javadoc "If lots of columns per row, it will use lots of memory sorting." This can be problematic for WALPlayer, which uses HFileOutputFormat2. You could have reasonably sized row which just gets lots of edits in the time period of WALs being replayed, and that would cause an OOM. We are seeing this in some cases with incremental backups.

MapReduce has built-in sorting capabilities which are not limited to sorting in memory. It can spill to disk as necessary to sort very large datasets. We can get this capability in HFileOutputFormat2 with a couple changes:

Add support for a KeyOnlyCellComparable type as the map output key
When configured, use job.setSortComparatorClass(CellWritableComparator.class) and job.setReducerClass(PreSortedCellsReducer.class)
Update WALPlayer to have a mode which can output this new comparable instead of ImmutableBytesWritable

CellWritableComparator exists already for the Import job, so there is some prior art.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Bryan Beaudreault

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Mar/24 15:49

Updated:: 13/Mar/24 15:49