[HIVE-7805] Support running multiple scans in hbase-handler - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.14.0
Fix Version/s: None
Component/s: HBase Handler
Labels:
None

Description

Currently, the HiveHBaseTableInputFormat only supports running a single scan. This can be less efficient than running multiple disjoint scans in certain cases, particularly when using a composite row key. For instance, given a row key schema of:

struct<bucket int, time timestamp>

if one wants to push down the predicate:

bucket IN (1, 10, 100) AND timestamp >= 1408333927 AND timestamp < 1408506670

it's much more efficient to run a scan for each bucket over the time range (particularly if there's a large amount of data per day). With a single scan, the MR job has to process the data for all time for buckets in between 1 and 100.

hive should allow HBaseKeyFactory's to decompose a predicate into one or more scans in order to take advantage of this fact.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-7805.patch
20/Aug/14 19:59
38 kB
Andrew Mains
HIVE-7805.1.patch
22/Aug/14 20:29
69 kB
Andrew Mains
HIVE-7805.2.patch
25/Apr/15 23:44
70 kB
Andrew Mains

Activity

People

Assignee:: Andrew Mains

Reporter:: Andrew Mains

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Aug/14 19:54

Updated:: 26/Apr/15 02:01