Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
A core concept of Apache Spark is the resilient distributed dataset (RDD), a "fault-tolerant collection of elements that can be operated on in parallel". One can create a RDDs referencing a dataset in any external storage system offering a Hadoop InputFormat, like PhoenixInputFormat and PhoenixOutputFormat. There could be opportunities for additional interesting and deep integration.
Add the ability to save RDDs back to Phoenix with a saveAsPhoenixTable action, implicitly creating necessary schema on demand.
Add support for filter transformations that push predicates to the server.
Add a new select transformation supporting a LINQ-like DSL, for example:
// Count the number of different coffee varieties offered by each // supplier from Guatemala phoenixTable("coffees") .select(c => where(c.origin == "GT")) .countByKey() .foreach(r => println(r._1 + "=" + r._2))
Support conversions between Scala and Java types and Phoenix table data.
Attachments
Issue Links
- relates to
-
PHOENIX-1811 Provide Java Wrappers to the Scala api in phoenix-spark module
- Resolved
-
HBASE-11482 Optimize HBase TableInput/OutputFormats for exposing tables and snapshots as Spark RDDs
- Closed
-
PHOENIX-1815 Use Spark Data Source API in phoenix-spark module
- Closed