[HBASE-11482] Optimize HBase TableInput/OutputFormats for exposing tables and snapshots as Spark RDDs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: mapreduce, spark
Labels:
None

Description

A core concept of Apache Spark is the resilient distributed dataset (RDD), a "fault-tolerant collection of elements that can be operated on in parallel". One can create a RDDs referencing a dataset in any external storage system offering a Hadoop InputFormat, like HBase's TableInputFormat and TableSnapshotInputFormat.

Insure the integration is reasonable and provides good performance.

Add the ability to save RDDs back to HBase with a saveAsHBaseTable action, implicitly creating necessary schema on demand.

Add support for filter transformations that push predicates down to the server as HBase filters.

Consider supporting conversions between Scala and Java types and HBase data using the HBase types library.

Consider an option to lazily and automatically produce a snapshot only when needed, in a coordinated way. (Concurrently executing workers may want to materialize a table snapshot RDD at the same time.)

Attachments

Issue Links

is duplicated by

HBASE-13992 Integrate SparkOnHBase into HBase

Closed

is related to

PHOENIX-1071 Provide integration for exposing Phoenix tables as Spark RDDs

Closed

SPARK-2447 Add common solution for sending upsert actions to HBase (put, deletes, and increment)

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Andrew Kyle Purtell

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 08/Jul/14 21:10

Updated:: 17/Jun/22 06:03

Resolved:: 15/Jul/15 13:20