[HBASE-15171] Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.1.2, 0.98.17, 2.0.0
Fix Version/s: 1.3.0, 0.98.18, 2.0.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

Once there was one of our online user writing huge number of duplicated kvs during bulkload, and we found it generated lots of small hfiles and slows down the whole process.

After debugging, we found in PutSortReducer#reduce, although it already tried to handle the pathological case by setting a threshold for single-row size and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude duplicated kv from the accumulated size. As shown in below code segment:

while (iter.hasNext() && curSize < threshold) {
  Put p = iter.next();
  for (List<Cell> cells: p.getFamilyCellMap().values()) {
    for (Cell cell: cells) {
      KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
      map.add(kv);
      curSize += kv.heapSize();
    }
  }
}

We should move the curSize += kv.heapSize(); line out of the outer for loop

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-15171.patch
26/Jan/16 14:21
2 kB
Yu Li
HBASE-15171.patch
27/Jan/16 03:05
2 kB
Yu Li
HBASE-15171.patch
27/Jan/16 14:15
2 kB
Michael Stack
HBASE-15171.addendum.patch
28/Jan/16 03:57
1.0 kB
Yu Li

Activity

People

Assignee:: Yu Li

Reporter:: Yu Li

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 26/Jan/16 14:14

Updated:: 01/Jul/22 20:30

Resolved:: 28/Jan/16 15:08