Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-7805

Support running multiple scans in hbase-handler

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 0.14.0
    • None
    • HBase Handler
    • None

    Description

      Currently, the HiveHBaseTableInputFormat only supports running a single scan. This can be less efficient than running multiple disjoint scans in certain cases, particularly when using a composite row key. For instance, given a row key schema of:

      struct<bucket int, time timestamp>
      

      if one wants to push down the predicate:

      bucket IN (1, 10, 100) AND timestamp >= 1408333927 AND timestamp < 1408506670
      

      it's much more efficient to run a scan for each bucket over the time range (particularly if there's a large amount of data per day). With a single scan, the MR job has to process the data for all time for buckets in between 1 and 100.

      hive should allow HBaseKeyFactory's to decompose a predicate into one or more scans in order to take advantage of this fact.

      Attachments

        1. HIVE-7805.patch
          38 kB
          Andrew Mains
        2. HIVE-7805.1.patch
          69 kB
          Andrew Mains
        3. HIVE-7805.2.patch
          70 kB
          Andrew Mains

        Activity

          People

            amains12 Andrew Mains
            amains12 Andrew Mains
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: