HBase
  1. HBase
  2. HBASE-5140

TableInputFormat subclass to allow N number of splits per region during MR jobs

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Won't Fix
    • Affects Version/s: 0.90.4
    • Fix Version/s: None
    • Component/s: mapreduce
    • Labels:
    • Release Note:
      Used the 0.90 branch for the patch but code looks compatible in trunk as well (with one deprecated method)
    • Tags:
      mapreduce splits tableinputformat

      Description

      In regards to HBASE-5138 I am working on a patch for the TableInputFormat class that overrides getSplits in order to generate N number of splits per regions and/or N number of splits per job. The idea is to convert the startKey and endKey for each region from byte[] to BigDecimal, take the difference, divide by N, convert back to byte[] and generate splits on the resulting values. Assuming your keys are fully distributed this should generate splits at nearly the same number of rows per split. Any suggestions on this issue are welcome.

        Issue Links

          Activity

          Josh Wymer created issue -
          Ted Yu made changes -
          Field Original Value New Value
          Link This issue relates to HBASE-4063 [ HBASE-4063 ]
          Josh Wymer made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note Used the 0.90 branch for the patch but code looks compatible in trunk as well (with one deprecated method)
          Affects Version/s 0.90.4 [ 12316406 ]
          Labels mapreduce split
          Fix Version/s 0.90.4 [ 12316406 ]
          Tags mapreduce splits tableinputformat
          Josh Wymer made changes -
          Josh Wymer made changes -
          Description In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I am working on a subclass for the TableInputFormat class that overrides getSplits in order to generate N number of splits per regions and/or N number of splits per job. The idea is to convert the startKey and endKey for each region from byte[] to BigDecimal, take the difference, divide by N, convert back to byte[] and generate splits on the resulting values. Assuming your keys are fully distributed this should generate splits at nearly the same number of rows per split. Any suggestions on this issue are welcome. In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I am working on a patch for the TableInputFormat class that overrides getSplits in order to generate N number of splits per regions and/or N number of splits per job. The idea is to convert the startKey and endKey for each region from byte[] to BigDecimal, take the difference, divide by N, convert back to byte[] and generate splits on the resulting values. Assuming your keys are fully distributed this should generate splits at nearly the same number of rows per split. Any suggestions on this issue are welcome.
          Andrew Purtell made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.90.4 [ 12316406 ]
          Resolution Won't Fix [ 2 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Josh Wymer
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 72h
                72h
                Remaining:
                Remaining Estimate - 72h
                72h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development