Details
-
Epic
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
Language and runtime for parameter servers
Description
SystemML already provides a rich set of execution strategies ranging from local operations to large-scale computation on MapReduce or Spark. In this context, we support both data-parallel (multi-threaded or distributed operations) as well as task-parallel computation (multi-threaded or distributed parfor loops). This epic aims to complement the existing execution strategies by language and runtime primitives for parameter servers, i.e., model-parallel execution. We use the terminology of model-parallel execution with distributed data and distributed model to differentiate them from the existing data-parallel operations. Target applications are distributed deep learning and mini-batch algorithms in general. These new abstractions will help making SystemML a unified framework for small- and large-scale machine learning that supports all three major execution strategies in a single framework.
A major challenge is the integration of stateful parameter servers and their common push/pull primitives into an otherwise functional (and thus, stateless) language. We will approach this challenge via a new builtin function paramserv which internally maintains state but at the same time fits into the runtime framework of stateless operations.
Furthermore, we are interested in providing (1) different runtime backends (local and distributed), (2) different parameter server modes (synchronous, asynchronous, hogwild!, stale-synchronous), (3) different update frequencies (batch, multi-batch, epoch), as well as (4) different architectures for distributed data (1 parameter server, k workers) and distributed model (k1 parameter servers, k2 workers).
Note for GSOC students: This is large project which will be broken down into sub projects, so everybody will be having their share of pie.
Prerequistes: Java, machine learning experience is a plus but not required.
Attachments
Attachments
Issue Links
- is duplicated by
-
SYSTEMDS-739 Explore model-parallel constructs in DML
- Closed
- is related to
-
SYSTEMDS-2379 Performance sample operations
- Closed
Issues in epic
|
SYSTEMDS-2397 | Paramserv ASP failing w/ OOM (too many threads) | Closed | LI Guobao | ||
|
SYSTEMDS-2398 | Paramserv ASP aggregation overhead on update per epoch | Closed | LI Guobao | ||
|
SYSTEMDS-2399 | Paramserv list indexing slicing w/o dynamic recompile fails | Closed | Matthias Boehm | ||
|
SYSTEMDS-2401 | Paramserv worker number not set as user expected | Closed | LI Guobao | ||
|
SYSTEMDS-2403 | Paramserv low accuracy sometimes occurred | Closed | LI Guobao | ||
|
SYSTEMDS-2406 | Paramserv low accuracy with EPOCH update frequency | Closed | LI Guobao | ||
|
SYSTEMDS-2412 | Paramserv "all the same accuracy" problem | Closed | LI Guobao | ||
|
SYSTEMDS-2413 | Paramserv performance bottleneck on the DAG recompilation | Closed | LI Guobao | ||
|
SYSTEMDS-2414 | Paramserv zero accuracy with Overlap_Reshuffle | Closed | LI Guobao | ||
|
SYSTEMDS-2440 | Got zero when casting an element of list | Closed | Matthias Boehm | ||
|
SYSTEMDS-2469 | Large distributed paramserv overheads | Resolved | LI Guobao | ||
|
SYSTEMDS-2476 | Unexpected mapreduce task | Closed | Matthias Boehm | ||
|
SYSTEMDS-2477 | NPE when copying list object | Resolved | LI Guobao | ||
SYSTEMDS-2478 | Overhead when using parfor in update func | Open | Unassigned | |||
|
SYSTEMDS-2482 | Unexpected cleanup of list object | Resolved | Unassigned |