Details
-
New Feature
-
Status: To Do
-
Minor
-
Resolution: Unresolved
-
None
Description
- Horovod: use a custom trainer
- Parameter Server: batch_fn, trainer.step, should be the same as single node multi-GPU
- consider on the convention to do mean(loss) and step(1) or step(batch_size), batch_size in Horovod is per device, in PS is per worker