Details
-
Story
-
Status: Resolved
-
Major
-
Resolution: Resolved
-
3.0.0
-
None
-
None
Description
In barrier mode, to run hybrid distributed DL training jobs, we need to provide users sufficient info and access so they can set up a hybrid distributed training job, e.g., using MPI.
This ticket limits the scope of discussion to Spark + Standalone. For MPI, what we need is password-less SSH access among workers. We might also consider other distributed frameworks, like distributed tensorflow, H2O, etc.