Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-12590

A solution for data skew in HBase-Mapreduce Job

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • mapreduce
    • None
    • Reviewed

    Description

      1, Motivation
      In production environment, data skew is a very common case. A HBase table may contains a lot of small regions and several large regions. Small regions waste a lot of computing resources. If we use a job to scan a table with 3000 small regions, we need a job with 3000 mappers. Large regions always block the job. If in a 100-region table, one region is far large then the other 99 regions. When we run a job with the table as input, 99 mappers will be completed very quickly, and then we need to wait for the last mapper for a long time.

      2, Configuration
      Add three new configuration
      hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in HBase-MapReduce jobs. The default value is false.
      hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region size is larger than 3x average region size, treat the region as “proportionately too large”.
      hbase.table.row.textkey = true means the row key is text. False means binary row key. It is used to find the mid row key in large region. The default value is true.
      If (region size >= average size*ratio) : cut the region into two MR input splits
      If (average size <= region size < average size*ratio) : one region as one MR input split
      If (sum of several continuous regions size < average size): combine these regions into one MR input split.

      Example:
      In attachment

      Welcome to the Review Board.
      https://reviews.apache.org/r/28494/diff/#

      Attachments

        1. HBASE-12590-v4.patch
          18 kB
          Weichen Ye
        2. A Solution for Data Skew in HBase-MapReduce Job (Version3).pdf
          460 kB
          Weichen Ye
        3. HBASE-12590-v3.patch
          16 kB
          Weichen Ye
        4. HBase-12590-v2.patch
          11 kB
          Weichen Ye
        5. A Solution for Data Skew in HBase-MapReduce Job (Version2).pdf
          458 kB
          Weichen Ye
        6. HBase-12590-v1.patch
          10 kB
          Weichen Ye

        Issue Links

          Activity

            People

              yeweichen Weichen Ye
              yeweichen Weichen Ye
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: