Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-28440

Add support for using mapreduce sort in HFileOutputFormat2

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • backup&restore
    • None

    Description

      Currently HFileOutputFormat2 uses CellSortReducer, which attempts to sort all of the cells of a row in memory using a TreeSet. There is a warning in the javadoc "If lots of columns per row, it will use lots of memory sorting." This can be problematic for WALPlayer, which uses HFileOutputFormat2. You could have reasonably sized row which just gets lots of edits in the time period of WALs being replayed, and that would cause an OOM. We are seeing this in some cases with incremental backups.

      MapReduce has built-in sorting capabilities which are not limited to sorting in memory. It can spill to disk as necessary to sort very large datasets. We can get this capability in HFileOutputFormat2 with a couple changes:

      1. Add support for a KeyOnlyCellComparable type as the map output key
      2. When configured, use job.setSortComparatorClass(CellWritableComparator.class) and job.setReducerClass(PreSortedCellsReducer.class)
      3. Update WALPlayer to have a mode which can output this new comparable instead of ImmutableBytesWritable

      CellWritableComparator exists already for the Import job, so there is some prior art. 

      Attachments

        Activity

          People

            Unassigned Unassigned
            bbeaudreault Bryan Beaudreault
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: