HBase
  1. HBase
  2. HBASE-1901

"General" partitioner for "hbase-48" bulk (behind the api, write hfiles direct) uploader

    Details

    • Type: Wish Wish
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.0
    • Component/s: None
    • Labels:
      None

      Description

      For users to bulk upload by writing hfiles directly to the filesystem, they currently need to write a partitioner that is intimate with how their key schema works. This issue is about providing a general partitioner, one that could never be as fair as a custom-written partitioner but that might just work for many cases. The idea is that a user would supply the first and last keys in their dataset to upload. We'd then do bigdecimal on the range between start and end rowids dividing it by the number of reducers to come up with key ranges per reducer.

      (I thought jgray had done some BigDecimal work dividing keys already but I can't find it)

        Activity

        Hide
        anty.rao added a comment -

        Maybe we could sample the data first to determine the key range.

        Show
        anty.rao added a comment - Maybe we could sample the data first to determine the key range.
        Hide
        stack added a comment -

        Sampling would make for better partitioning. Thats what the totalorderpartitioner up in hadoop does. We could do a sampling partitioner in a different issue?

        Show
        stack added a comment - Sampling would make for better partitioning. Thats what the totalorderpartitioner up in hadoop does. We could do a sampling partitioner in a different issue?
        Hide
        Jonathan Gray added a comment -

        The BigInteger range dividing stuff I did is over in HBASE-1183

        Show
        Jonathan Gray added a comment - The BigInteger range dividing stuff I did is over in HBASE-1183
        Hide
        stack added a comment -

        Here is first cut at a simpleorderpreserving partitioner. I changed the hfile test output format to use it though its not really suitable. I updated the package doc. to discuss this new class.

        Show
        stack added a comment - Here is first cut at a simpleorderpreserving partitioner. I changed the hfile test output format to use it though its not really suitable. I updated the package doc. to discuss this new class.
        Hide
        anty.rao added a comment -

        Hi: stack ,
        I have done some test and find we should change the codes of TestHFileOutputFormat a little ,or the test won't work.
        " int rows = this.conf.getInt("mapred.map.tasks", 1) * ROWSPERSPLIT;"
        should be
        " int rows = this.conf.getInt("mapred.map.tasks", 1) * ROWSPERSPLIT+2;"
        just as you said , The end key needs to be exclusive; i.e. one larger than the biggest key in your key space.
        however ,the key range of TestHFileOutFormat is 1----conf.getInt("mapred.map.tasks",1)*ROWSPERSLPLIT+1,so we should add 1 more to rows(the end key).
        except that ,everything looks right.the STARTKEY and ENDKEY of each region are correct.
        the precondition is we should know the startKey and endKey,now you have written the partitioner,can we write a MR job to calculate the startKey and endKey ?

        Show
        anty.rao added a comment - Hi: stack , I have done some test and find we should change the codes of TestHFileOutputFormat a little ,or the test won't work. " int rows = this.conf.getInt("mapred.map.tasks", 1) * ROWSPERSPLIT;" should be " int rows = this.conf.getInt("mapred.map.tasks", 1) * ROWSPERSPLIT+2;" just as you said , The end key needs to be exclusive; i.e. one larger than the biggest key in your key space. however ,the key range of TestHFileOutFormat is 1----conf.getInt("mapred.map.tasks",1)*ROWSPERSLPLIT+1,so we should add 1 more to rows(the end key). except that ,everything looks right.the STARTKEY and ENDKEY of each region are correct. the precondition is we should know the startKey and endKey,now you have written the partitioner,can we write a MR job to calculate the startKey and endKey ?
        Hide
        stack added a comment -

        @anty Thanks for finding issue in patch. Will post new one soon.

        Show
        stack added a comment - @anty Thanks for finding issue in patch. Will post new one soon.
        Hide
        stack added a comment -

        Committed to TRUNK.

        Show
        stack added a comment - Committed to TRUNK.

          People

          • Assignee:
            stack
            Reporter:
            stack
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development