[IGNITE-6783] Create common mechanism for group training. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.4
Component/s: ml
Labels:
None

Description

In distributed ML it is a common task to train several models in parallel with ability to communicate with each other during training. Simple example of this case is training of neural network with SGD on different chunks of data located on several nodes. In such training we do the following in a loop: on each node we do one or several SGD steps then send gradient on central node which averages gradients from each of worker nodes and send back the averaged gradient. There is a pattern in this procedure which can be applied to other ML algos and it could be useful to extract this pattern.

Attachments

Issue Links

links to

GitHub Pull Request #3297

lic & javadoc

TC build

Activity

People

Assignee:: Artem Malykh

Reporter:: Artem Malykh

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Oct/17 16:42

Updated:: 30/Dec/17 22:05

Resolved:: 30/Dec/17 22:05