Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-3936

Incremental bulk load support for Increments

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • None
    • None

    Description

      From http://hbase.apache.org/bulk-loads.html: "The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly loads the data files into a running cluster. Using bulk load will use less CPU and network than going via the HBase API."

      I have been working with a specific implementation of, and can envision, a class of applications that reduce data into a large collection of counters, perhaps building projections of the data in many dimensions in the process. One can use Hadoop MapReduce as the engine to accomplish this for a given data set and use LoadIncrementalHFiles to move the result into place for live serving. MR is natural for summation over very large counter sets: emit counter increments for the data set and projections thereof in mappers, use combiners for partial aggregation, use reducers to do final summation into HFiles.

      However, it is not possible to then merge in a set of updates to an existing table built in the manner above without either 1) joining the table data and the update set into a large MR temporary set, followed by a complete rewrite of the table; or 2) posting all of the updates as Increments via the HBase API, impacting any other concurrent users of the HBase service, and perhaps taking 10-100 times longer than if updates could be computed directly into HFiles like the original import. Both of these alternatives are expensive in terms of CPU and time; one is also expensive in terms of disk.

      I propose adding incremental bulk load support for Increments. Here is a sketch of a possible implementation:

      • Add a KV type for Increment
      • Modify HFile main, LoadIncrementalHFiles, and others that work with HFiles directly to handle the new KV type
      • Bulk load API can move the files to be merged into the Stores as before.
      • Implement an alternate compaction algorithm or modify the existing. Need to identify Increments and apply them to an existing most recent version of a value, or create the value if it does not exist.
        • Use KeyValueHeap as is to merge value-sets by row as before.
        • For each row, use a KV-keyed Map for in memory update of values.
        • If there is an existing value and it is not a serialized long, ignore the Increment and log at INFO level.
        • Use the persistent HashMapWrapper from Hive's CommonJoinOperator, with an appropriate memory limit, so work for overlarge rows will spill to disk. Can be local disk, not HDFS.
      • Never return an Increment KV to a client doing a Get or Scan.
        • Before the merge is complete, if we find an Increment KV when searching Store files for a value, continue searching back in the Store files until we find a Put KV for the value, adding up Increments as they are encountered, then applying them to the Put value; or until search ends, in which case the Increment is treated as a Put.
        • If there is an existing value and it is not a serialized long, ignore the Increment and log at INFO level.
      • As a beneficial side effect, with Increments as just another KV type we can unify Put and Increment handling.

      Because this is a core concern I'd prefer discussing this as a possible enhancement of core as opposed to a Coprocessor-based extension. However it could be possible to implement all but the KV changes within the Coprocessor framework.

      Attachments

        Activity

          People

            Unassigned Unassigned
            apurtell Andrew Kyle Purtell
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: