Details
-
Task
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
So far all versions of spark right indexing instructions require a full scan over the data set. In case of existing partitioning (which anyway happens for any external format - binary block conversion) such a full scan is unnecessary if we're only interested in a small subset of the data. This task adds an efficient right indexing operation via 'rdd lookups' which access at most <num_lookup> partitions given existing hash partitioning.
In detail, this task covers the following improvements for spark matrix right indexing. Frames are not covered here because they allow variable-length blocks. Also, note that it is important to differentiate between in-core and out-of-core matrices: for in-core matrices (i.e., matrices that fit in deserialized form into aggregated memory), the full scan is actually not problematic as the filter operation only scans keys without touching the actual values.
(1) Scan-based indexing w/o aggregation: So far, we apply aggregations to merge partial blocks very conservatively. However, if the indexing range is block aligned (e.g., dimension start at block boundary or range within single block) this is unnecessary. This alone led to a 2x improvement for indexing row batches out of an in-core matrix.
(2) Single-block lookup: If the indexing range covers a subrange of a single block, we directly perform a lookup. On in-core matrices this gives a minor improvement (but does not hurt) while on out-of-core matrices, the improvement is huge in case of existing partitioner as we only have to scan a single partition instead of the entire data.
(3) Multi-block lookups: Unfortunately, Spark does not provide a lookup for a list of keys. So the next best option is a data-query join (in case of existing partitioner) with data.join(filter).map(), which works very well for in-core data sets, but for out-of-core datasets, unfortunately, does not exploit the potential for partition pruning and thus reads the entire data. I also experimented with a custom multi-block lookup that runs multiple lookups in a multi-threaded fashion - this gave the expected pruning but was very ugly due to an unbounded number of jobs.
In conclusion, I'll create a patch for scenarios (1) and (2), while scenario (3) requires some more thoughts and is postponed after the 0.11 release. One idea would be to create a custom RDD that implements lookup(List<T> keys) by constructing a pruned set of input partitions via partitioner.getPartition(key). cc freiss niketanpansare reinwald
Attachments
Attachments
Issue Links
- is depended upon by
-
SYSTEMDS-1160 Enable Prefetching of Mini-Batches
- Open
- is related to
-
SYSTEMDS-540 Deep Learning
- In Progress
-
SYSTEMDS-266 RDD-partitioning-aware compiler/runtime
- Closed