Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Story
`As a MADlib developer`
I want investigate convergence behavior when running a single distributed CNN model across the Greenplum cluster using Keras with a Tensorflow backend
`so that`
I can see if it converges in a predictable and expected way.
Details
- By "single distributed CNN model" I mean data parallel with merge (not model parallel) [6,7,8].
- In defining the merge function, review [1] for single-server, multi-GPU merge function, or use standard MADlib weighted average approach.
- For dataset, consider MNIST and/or CIFAR-10. A bigger data set like Places http://places2.csail.mit.edu/ may also be useful.
Acceptance
1) Plot characteristic curves of loss vs. iteration number. Compare with MADlib merge (this story) vs. without merge.
2) Define what the merge function is for CNN. Is it the same as [1] or something else? Does it operate on weights only or does it need gradients?
References
[1] Check for “# Merge outputs under expected scope” section in the python program
https://github.com/keras-team/keras/blob/bf1378f39d02b7d0b53ece5458f9275ac8208046/keras/utils/multi_gpu_utils.py
[2] Single Machine Data Parallel multi GPU Training
https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/
[3] Why are GPUs necessary for training Deep Learning models?
https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/
[4] Deep Learning vs Classical Machine Learning
https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa
[5] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf
[6] Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
https://arxiv.org/pdf/1802.09941.pdf
- see section 7.4.2 for discussion on model averaging
[7] Deep learning with Elastic Averaging SGD
https://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd.pdf
- use momentum and Nesterov methods in model averaging computation
[8] SCALABLE TRAINING OF DEEP LEARNING MACHINES BY INCREMENTAL BLOCK
TRAINING WITH INTRA-BLOCK PARALLEL OPTIMIZATION AND BLOCKWISE
MODEL-UPDATE FILTERING
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/0005880.pdf
- similar to [7], uses momentum methods