[SYSTEMDS-1760] Improve engine robustness of distributed SGD training - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Algorithms, Compiler, ParFor
Labels:
None

Epic Link:
Deep Learning DML Library
Sprint:
Sprint 2

Description

Currently, we have a mathematical framework in place for training with distributed SGD in a distributed MNIST LeNet example . This task aims to push this at scale to determine (1) the current behavior of the engine (i.e. does the optimizer actually run this in a distributed fashion, and (2) ways to improve the robustness and performance for this scenario. The distributed SGD framework from this example has already been ported into Caffe2DML, and thus improvements made for this task will directly benefit our efforts towards distributed training of Caffe models (and Keras in the future).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Runtime_Table.png
28/Jul/17 17:16
42 kB
Fei Hu

Issue Links

is a parent of

SYSTEMDS-1774 Improve Parfor parallelism for deep learning

Closed

relates to

SYSTEMDS-1563 Add a distributed synchronous SGD MNIST LeNet example

Closed

Activity

People

Assignee:: Fei Hu

Reporter:: Mike Dusenberry

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Jul/17 00:26

Updated:: 03/Aug/17 16:59