Uploaded image for project: 'Apache MXNet (Retired)'
  1. Apache MXNet (Retired)
  2. MXNET-1374

Extensible for distributed training

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: To Do
    • Minor
    • Resolution: Unresolved
    • Gluon
    • None

    Description

      1. Horovod: use a custom trainer
      2. Parameter Server: batch_fn, trainer.step, should be the same as single node multi-GPU
      3. consider on the convention to do mean(loss) and step(1) or step(batch_size), batch_size in Horovod is per device, in PS is per worker

      Attachments

        Activity

          People

            Unassigned Unassigned
            roywei Lai Wei
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: