[SPARK-2365] Add IndexedRDD, an efficient updatable key-value store - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Later
Affects Version/s: None
Fix Version/s: None
Component/s: GraphX, Spark Core
Labels:
None

Description

RDDs currently provide a bulk-updatable, iterator-based interface. This imposes minimal requirements on the storage layer, which only needs to support sequential access, enabling on-disk and serialized storage.

However, many applications would benefit from a richer interface. Efficient support for point lookups would enable serving data out of RDDs, but it currently requires iterating over an entire partition to find the desired element. Point updates similarly require copying an entire iterator. Joins are also expensive, requiring a shuffle and local hash joins.

To address these problems, we propose IndexedRDD, an efficient key-value store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key uniqueness and pre-indexing the entries for efficient joins and point lookups, updates, and deletions.

It would be implemented by (1) hash-partitioning the entries by key, (2) maintaining a hash index within each partition, and (3) using purely functional (immutable and efficiently updatable) data structures to enable efficient modifications and deletions.

GraphX would be the first user of IndexedRDD, since it currently implements a limited form of this functionality in VertexRDD. We envision a variety of other uses for IndexedRDD, including streaming updates to RDDs, direct serving from RDDs, and as an execution strategy for Spark SQL.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

2014-07-07-IndexedRDD-design-review.pdf
17/Jul/14 22:31
439 kB
Ankur Dave

Issue Links

incorporates

SPARK-1955 VertexRDD can incorrectly assume index sharing

Resolved

links to

[Github] Pull Request #1297 (ankurdave)

IndexedRDD on Spark Packages

Sub-Tasks

1.	Support for arbitrary key types in IndexedRDD	Resolved	Ankur Dave
2.	Extract IndexedRDD interface	Resolved	Ankur Dave
3.	Add log-structured updates with merge	Resolved	Ankur Dave
4.	Add a non-updatable implementation for read performance	Resolved	Ankur Dave
5.	Batch multiput updates within partitions	Resolved	Ankur Dave
6.	Move IndexedRDD from a pull request into a separate repository	Resolved	Ankur Dave

Activity

People

Assignee:: Ankur Dave

Reporter:: Ankur Dave

Votes:: 32 Vote for this issue

Watchers:: 70 Start watching this issue

Dates

Created:: 04/Jul/14 03:23

Updated:: 14/Sep/16 03:26

Resolved:: 17/Dec/15 06:43