Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-951

Efficient spark right indexing via lookup

    XMLWordPrintableJSON

Details

    • Task
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Runtime
    • None

    Description

      So far all versions of spark right indexing instructions require a full scan over the data set. In case of existing partitioning (which anyway happens for any external format - binary block conversion) such a full scan is unnecessary if we're only interested in a small subset of the data. This task adds an efficient right indexing operation via 'rdd lookups' which access at most <num_lookup> partitions given existing hash partitioning.

      cc mwdusenb@us.ibm.com

      In detail, this task covers the following improvements for spark matrix right indexing. Frames are not covered here because they allow variable-length blocks. Also, note that it is important to differentiate between in-core and out-of-core matrices: for in-core matrices (i.e., matrices that fit in deserialized form into aggregated memory), the full scan is actually not problematic as the filter operation only scans keys without touching the actual values.

      (1) Scan-based indexing w/o aggregation: So far, we apply aggregations to merge partial blocks very conservatively. However, if the indexing range is block aligned (e.g., dimension start at block boundary or range within single block) this is unnecessary. This alone led to a 2x improvement for indexing row batches out of an in-core matrix.

      (2) Single-block lookup: If the indexing range covers a subrange of a single block, we directly perform a lookup. On in-core matrices this gives a minor improvement (but does not hurt) while on out-of-core matrices, the improvement is huge in case of existing partitioner as we only have to scan a single partition instead of the entire data.

      (3) Multi-block lookups: Unfortunately, Spark does not provide a lookup for a list of keys. So the next best option is a data-query join (in case of existing partitioner) with data.join(filter).map(), which works very well for in-core data sets, but for out-of-core datasets, unfortunately, does not exploit the potential for partition pruning and thus reads the entire data. I also experimented with a custom multi-block lookup that runs multiple lookups in a multi-threaded fashion - this gave the expected pruning but was very ugly due to an unbounded number of jobs.

      In conclusion, I'll create a patch for scenarios (1) and (2), while scenario (3) requires some more thoughts and is postponed after the 0.11 release. One idea would be to create a custom RDD that implements lookup(List<T> keys) by constructing a pruned set of input partitions via partitioner.getPartition(key). cc freiss niketanpansare reinwald

      Attachments

        1. mnist_softmax_v1.dml
          7 kB
          Mike Dusenberry
        2. mnist_softmax_v2.dml
          7 kB
          Mike Dusenberry

        Issue Links

          Activity

            People

              mboehm7 Matthias Boehm
              mboehm7 Matthias Boehm
              Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: