[MXNET-1374] Extensible for distributed training - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: To Do
Priority: Minor
Resolution: Unresolved
Component/s: Gluon
Labels:
None

Epic Link:
Gluon Fit API

Description

Horovod: use a custom trainer
Parameter Server: batch_fn, trainer.step, should be the same as single node multi-GPU
consider on the convention to do mean(loss) and step(1) or step(batch_size), batch_size in Horovod is per device, in PS is per worker

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Lai Wei

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Mar/19 12:47

Updated:: 27/Mar/19 12:47