Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-15161 Umbrella: Miscellaneous improvements from production usage
  3. HBASE-15171

Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.2, 0.98.17, 2.0.0
    • Fix Version/s: 1.3.0, 0.98.18, 2.0.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Once there was one of our online user writing huge number of duplicated kvs during bulkload, and we found it generated lots of small hfiles and slows down the whole process.

      After debugging, we found in PutSortReducer#reduce, although it already tried to handle the pathological case by setting a threshold for single-row size and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude duplicated kv from the accumulated size. As shown in below code segment:

      while (iter.hasNext() && curSize < threshold) {
        Put p = iter.next();
        for (List<Cell> cells: p.getFamilyCellMap().values()) {
          for (Cell cell: cells) {
            KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
            map.add(kv);
            curSize += kv.heapSize();
          }
        }
      }
      

      We should move the curSize += kv.heapSize(); line out of the outer for loop

        Attachments

        1. HBASE-15171.patch
          2 kB
          Yu Li
        2. HBASE-15171.patch
          2 kB
          Yu Li
        3. HBASE-15171.patch
          2 kB
          Michael Stack
        4. HBASE-15171.addendum.patch
          1.0 kB
          Yu Li

          Activity

            People

            • Assignee:
              liyu Yu Li
              Reporter:
              liyu Yu Li

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment