[HBASE-14150] Add BulkLoad functionality to HBase-Spark Module - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0-alpha-1, connector-1.0.0
Component/s: hbase-connectors, spark
Labels:
None

Hadoop Flags:

Reviewed

Description

Add on to the work done in ~~HBASE-13992~~ to add functionality to do a bulk load from a given RDD.

This will do the following:
1. figure out the number of regions and sort and partition the data correctly to be written out to HFiles
2. Also unlike the MR bulkload I would like that the columns to be sorted in the shuffle stage and not in the memory of the reducer. This will allow this design to support super wide records with out going out of memory.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-14150.1.patch
03/Aug/15 19:58
43 kB
Theodore michael Malaska
HBASE-14150.2.patch
04/Aug/15 13:42
42 kB
Theodore michael Malaska
HBASE-14150.3.patch
06/Aug/15 21:16
47 kB
Theodore michael Malaska
HBASE-14150.4.patch
06/Aug/15 22:05
47 kB
Theodore michael Malaska
HBASE-14150.5.patch
11/Aug/15 22:18
47 kB
Theodore michael Malaska

Issue Links

depends upon

HBASE-13992 Integrate SparkOnHBase into HBase

Closed

is depended upon by

HBASE-14340 Add second bulk load option to Spark Bulk Load to send puts as the value

Closed

HBASE-14158 Add documentation for Initial Release for HBase-Spark Module integration

Closed

HBASE-14216 Consolidate MR and Spark BulkLoad shared functions and string consts

Closed

HBASE-14217 Add Java access to Spark bulk load functionality

Closed

links to

Review Board

(1 links to)

Activity

People

Assignee:: Theodore michael Malaska

Reporter:: Theodore michael Malaska

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 23/Jul/15 00:38

Updated:: 24/Jun/22 19:30

Resolved:: 12/Aug/15 15:29