[SINGA-32] Implement AllReduce training framework - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Component/s: None
Labels:
None

Description

The AllReduce training framework runs in synchronous mode, where one worker starts the next iteration after all workers have finished the previous iteration. Baidu's deepimage system uses this training framework.

To implement it in SINGA, we launch one worker group and one server group. The model is partitioned (e.g., on dimension 0) among all workers. Params are sliced and partitioned among all servers.

At the beginning, each Param (slice) is put into server shard including number of workers computing gradient for it.

For each iteration, the local stub aggregates all gradients for the same Param and sends to corresponding server including the number of local workers computing gradient for it. The server will buffer update requests and conducts update for a Param slice until it receives gradients from all workers. It sends back the updated Param (slices) to the corresponding process (stub).

Attachments

Activity

People

Assignee:: wangwei

Reporter:: wangwei

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Jul/15 06:52

Updated:: 19/Jul/15 02:19

Resolved:: 19/Jul/15 02:19