[MAPREDUCE-3086] Supporting range scan using TFile, TotalOrderPartitioner and partition index - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.20.205.0
Fix Version/s: None
Component/s: None
Labels:
None

Target Version/s:

0.20.205.0

Description

Hive/HBase already has similar or more powerful functionality, but using hive/hbase is overkill or inconvenient for some cases, so add some lightweight utility classes to only support range scan should be reasonable. The utility classes include:

InputFormat supporting range scan: Indexed(Text|Binary)InputFormat
The input directory for IndexInputFormat should contain one partition index and many tfiles, each tfile store a certain range of keys, not overlapping with other tfiles, the boundaries are stored in partition index.
Add 4 jobconfs: mapred.indexed(text|binary)inputformat.key.(start|end), indicate range scan parameters.
For a mapreduce job using IndexedInputFormat, IndexedInputFormat.getSplits filter out tfiles which are not in the scan range using partition index
IndexedInputFormat do not support multi directory & splitting in single file, these can be added in future.
Tool to convert data of other format into IndexedInputForamt: TotalOrderIndexBuilder
If the input data is already total order partitioned and is tfile format, just add partition index to input directory
Or run InputSampler to generate partiton index, then run mapreduce job with TotalOrder partitioner to generate tfile backed data, finally move partition index to output directory.
Client tool to scan/search indexed data directory

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-3086.v1.patch
11/Oct/11 14:03
55 kB
Binglin Chang

Activity

People

Assignee:: Binglin Chang

Reporter:: Binglin Chang

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 25/Sep/11 08:17

Updated:: 06/Feb/15 23:19