[SYSTEMDS-951] Efficient spark right indexing via lookup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Runtime
Labels:
None

Description

So far all versions of spark right indexing instructions require a full scan over the data set. In case of existing partitioning (which anyway happens for any external format - binary block conversion) such a full scan is unnecessary if we're only interested in a small subset of the data. This task adds an efficient right indexing operation via 'rdd lookups' which access at most <num_lookup> partitions given existing hash partitioning.

cc mwdusenb@us.ibm.com

In detail, this task covers the following improvements for spark matrix right indexing. Frames are not covered here because they allow variable-length blocks. Also, note that it is important to differentiate between in-core and out-of-core matrices: for in-core matrices (i.e., matrices that fit in deserialized form into aggregated memory), the full scan is actually not problematic as the filter operation only scans keys without touching the actual values.

(1) Scan-based indexing w/o aggregation: So far, we apply aggregations to merge partial blocks very conservatively. However, if the indexing range is block aligned (e.g., dimension start at block boundary or range within single block) this is unnecessary. This alone led to a 2x improvement for indexing row batches out of an in-core matrix.

(2) Single-block lookup: If the indexing range covers a subrange of a single block, we directly perform a lookup. On in-core matrices this gives a minor improvement (but does not hurt) while on out-of-core matrices, the improvement is huge in case of existing partitioner as we only have to scan a single partition instead of the entire data.

(3) Multi-block lookups: Unfortunately, Spark does not provide a lookup for a list of keys. So the next best option is a data-query join (in case of existing partitioner) with data.join(filter).map(), which works very well for in-core data sets, but for out-of-core datasets, unfortunately, does not exploit the potential for partition pruning and thus reads the entire data. I also experimented with a custom multi-block lookup that runs multiple lookups in a multi-threaded fashion - this gave the expected pruning but was very ugly due to an unbounded number of jobs.

In conclusion, I'll create a patch for scenarios (1) and (2), while scenario (3) requires some more thoughts and is postponed after the 0.11 release. One idea would be to create a custom RDD that implements lookup(List<T> keys) by constructing a pruned set of input partitions via partitioner.getPartition(key). cc freiss niketanpansare reinwald

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

mnist_softmax_v1.dml
07/Oct/16 00:56
7 kB
Mike Dusenberry
mnist_softmax_v2.dml
07/Oct/16 00:56
7 kB
Mike Dusenberry

Issue Links

is depended upon by

SYSTEMDS-1160 Enable Prefetching of Mini-Batches

Open

is related to

SYSTEMDS-540 Deep Learning

In Progress

SYSTEMDS-266 RDD-partitioning-aware compiler/runtime

Closed

Activity

People

Assignee:: Matthias Boehm

Reporter:: Matthias Boehm

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Sep/16 04:33

Updated:: 23/Jan/17 20:13