[SYSTEMDS-2083] Language and runtime for parameter servers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Epic
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- gsoc2018

Epic Name:
Language and runtime for parameter servers

Description

SystemML already provides a rich set of execution strategies ranging from local operations to large-scale computation on MapReduce or Spark. In this context, we support both data-parallel (multi-threaded or distributed operations) as well as task-parallel computation (multi-threaded or distributed parfor loops). This epic aims to complement the existing execution strategies by language and runtime primitives for parameter servers, i.e., model-parallel execution. We use the terminology of model-parallel execution with distributed data and distributed model to differentiate them from the existing data-parallel operations. Target applications are distributed deep learning and mini-batch algorithms in general. These new abstractions will help making SystemML a unified framework for small- and large-scale machine learning that supports all three major execution strategies in a single framework.

A major challenge is the integration of stateful parameter servers and their common push/pull primitives into an otherwise functional (and thus, stateless) language. We will approach this challenge via a new builtin function paramserv which internally maintains state but at the same time fits into the runtime framework of stateless operations.

Furthermore, we are interested in providing (1) different runtime backends (local and distributed), (2) different parameter server modes (synchronous, asynchronous, hogwild!, stale-synchronous), (3) different update frequencies (batch, multi-batch, epoch), as well as (4) different architectures for distributed data (1 parameter server, k workers) and distributed model (k1 parameter servers, k2 workers).

Note for GSOC students: This is large project which will be broken down into sub projects, so everybody will be having their share of pie.

Prerequistes: Java, machine learning experience is a plus but not required.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2018-02-14-12-31-37-563.png
14/Feb/18 07:01
23 kB
Janardhan Pulivarthi
image-2018-02-14-12-21-00-932.png
14/Feb/18 06:51
54 kB
Janardhan Pulivarthi
image-2018-02-14-12-18-48-932.png
14/Feb/18 06:48
77 kB
Janardhan Pulivarthi

Issue Links

is duplicated by

SYSTEMDS-739 Explore model-parallel constructs in DML

Closed

is related to

SYSTEMDS-2379 Performance sample operations

Closed

Sub-Tasks

1.	Preparation of dev environment	Closed	LI Guobao
2.	API design of the paramserv function	Resolved	LI Guobao
3.	Implementation of a script with paramserv func	Closed	LI Guobao
4.	Implementation of language and compiler extension	Closed	LI Guobao
5.	Implementation of language extension	Closed	LI Guobao
6.	Hops, lops, instruction generation	Closed	LI Guobao
7.	IPA integration	Closed	LI Guobao
8.	Parfor integration	Closed	LI Guobao
9.	Initial version of local backend	Resolved	LI Guobao
10.	Local server	Closed	LI Guobao
11.	Data partition	Closed	LI Guobao
12.	Push/pull service	Closed	LI Guobao
13.	Local workers	Closed	LI Guobao
14.	Aggregation service	Closed	LI Guobao
15.	Determine the level of parallelism	Closed	LI Guobao
16.	Add the mnist-lenet test	Closed	LI Guobao
17.	Add some optional argument	Closed	LI Guobao
18.	Synchronization	Closed	LI Guobao
19.	Checkpointing	Open	LI Guobao
20.	Rewrite the ps using BlockingQueue	Closed	LI Guobao
21.	Error handling	Closed	LI Guobao
22.	Inline the agg service thread in ps	Closed	LI Guobao
23.	Passing function pointers	Open	LI Guobao
24.	Extend the update strategy ASP	Closed	LI Guobao
25.	Extend update per EPOCH	Closed	LI Guobao
26.	Extend scheme Disjoint_Round_Robin	Closed	LI Guobao
27.	Extend scheme Disjoint_Random	Closed	LI Guobao
28.	Extend scheme Overlap_Reshuffle	Closed	LI Guobao
29.	Extend the update strategy SSP	Open	LI Guobao
30.	Reorganise the test	Closed	LI Guobao
31.	Preparation of baseline experiments	Resolved	LI Guobao
32.	Add unit test for data partitioner	Closed	LI Guobao
33.	Shutdown the thread pool of agg service	Closed	LI Guobao
34.	Reorganise the api of data partitioner	Closed	LI Guobao
35.	Concurrent problem about LibMatrixMult	Closed	LI Guobao
36.	Forget to fetch the last iteration in local workers	Closed	LI Guobao
37.	Extend statistics for paramserv func	Closed	LI Guobao
38.	First evaluation	Closed	LI Guobao
39.	Batch pre-fetching per workers	Open	LI Guobao
40.	Rework the function block recompilation	Closed	LI Guobao
41.	Use synchronized method instead of single thread pool in agg service	Closed	LI Guobao
42.	Initial version of distributed spark backend	Resolved	LI Guobao
43.	Determine the level of par	Closed	LI Guobao
44.	Spark data partitioner	Resolved	LI Guobao
45.	Setup and cleanup of remote workers	Resolved	LI Guobao
46.	Implementation of remote worker	Resolved	LI Guobao
47.	Implementation of spark ps	Resolved	LI Guobao
48.	Communication between ps and workers	Resolved	LI Guobao
49.	Task error and preemption handles	Open	LI Guobao
50.	Second evaluation	Closed	LI Guobao
51.	Add experiments varied on optimizers	Resolved	LI Guobao
52.	Error handling and add statistics for spark backend	Resolved	LI Guobao
53.	Add experiment on spark paramserv	Resolved	LI Guobao
54.	Keep data consistency for a pre-trained model	Open	LI Guobao
55.	Add java doc	Resolved	LI Guobao
56.	Documentation of language extension	Resolved	LI Guobao
57.	Submit final product	Resolved	LI Guobao
58.	Second version of execution backend	Closed	LI Guobao
59.	Integration test, implementation of samples and documentation	Open	LI Guobao
60.	Performance features batching and parameter transfers	Open	Unassigned
61.	Extended Caffe2DML and Keras2DML script generators	Open	Unassigned

Activity

People

Assignee:: LI Guobao

Reporter:: Matthias Boehm

Votes:: 5 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 27/Jan/18 03:12

Updated:: 02/Aug/18 13:15