[HBASE-8369] MapReduce over snapshot files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.98.0
Component/s: mapreduce, snapshots
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
Added TableSnapshotInputFormat and TableSnapshotScanner for performing scans over hbase table snapshots from the client side, bypassing the hbase servers. The former configures a mapreduce job, while the latter does single client side scan over snapshot files. Can also be used with offline HBase with in-place or exported snapshot files.

WARNING: This feature bypasses HBase-level security completely since the files are read from the hdfs directly. The user who is running the scan / job has to have read permissions to the data files and snapshot files.

Show
Added TableSnapshotInputFormat and TableSnapshotScanner for performing scans over hbase table snapshots from the client side, bypassing the hbase servers. The former configures a mapreduce job, while the latter does single client side scan over snapshot files. Can also be used with offline HBase with in-place or exported snapshot files. WARNING: This feature bypasses HBase-level security completely since the files are read from the hdfs directly. The user who is running the scan / job has to have read permissions to the data files and snapshot files.

Description

The idea is to add an InputFormat, which can run the mapreduce job over snapshot files directly bypassing hbase server layer. The IF is similar in usage to TableInputFormat, taking a Scan object from the user, but instead of running from an online table, it runs from a table snapshot. We do one split per region in the snapshot, and open an HRegion inside the RecordReader. A RegionScanner is used internally for doing the scan without any HRegionServer bits.

Users have been asking and searching for ways to run MR jobs by reading directly from hfiles, so this allows new use cases if reading from stale data is ok:

Take snapshots periodically, and run MR jobs only on snapshots.
Export snapshots to remote hdfs cluster, run the MR jobs at that cluster without HBase cluster.
(Future use case) Combine snapshot data with online hbase data: Scan from yesterday's snapshot, but read today's data from online hbase cluster.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hbase-8369_v0.patch
17/Apr/13 22:32
73 kB
Enis Soztutar
hbase-8369_v11.patch
19/Nov/13 04:33
152 kB
Enis Soztutar
hbase-8369_v5.patch
30/Oct/13 01:53
160 kB
Enis Soztutar
hbase-8369_v6.patch
30/Oct/13 20:59
148 kB
Enis Soztutar
hbase-8369_v7.patch
05/Nov/13 03:13
151 kB
Enis Soztutar
hbase-8369_v8.patch
13/Nov/13 02:42
151 kB
Enis Soztutar
hbase-8369_v9.patch
13/Nov/13 22:35
150 kB
Enis Soztutar
HBASE-8369-0.94_v2.patch
01/Jul/13 22:45
24 kB
Bryan Keller
HBASE-8369-0.94_v3.patch
01/Jul/13 23:06
24 kB
Bryan Keller
HBASE-8369-0.94_v4.patch
02/Jul/13 15:54
24 kB
Bryan Keller
HBASE-8369-0.94_v5.patch
03/Jul/13 19:49
24 kB
Bryan Keller
HBASE-8369-0.94.patch
01/Jul/13 21:16
23 kB
Bryan Keller
HBASE-8369-trunk_v1.patch
02/Jul/13 00:03
24 kB
Bryan Keller
HBASE-8369-trunk_v2.patch
02/Jul/13 15:54
24 kB
Bryan Keller
HBASE-8369-trunk_v3.patch
03/Jul/13 19:49
24 kB
Bryan Keller

Issue Links

relates to

HBASE-10076 Backport MapReduce over snapshot files [0.94]

Closed

Sub-Tasks

Backport parent issue to 0.96

Closed

Michael Stack

Activity

People

Assignee:: Enis Soztutar

Reporter:: Enis Soztutar

Votes:: 2 Vote for this issue

Watchers:: 39 Start watching this issue

Dates

Created:: 17/Apr/13 22:28

Updated:: 20/Nov/15 11:52

Resolved:: 18/Nov/13 22:20