Details
-
Story
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.0.0
-
None
-
None
Description
In barrier mode, to run hybrid distributed DL training jobs, we need to provide users sufficient info and access so they can set up a hybrid distributed training job, e.g., using MPI.
This ticket limits the scope of discussion to Spark + Mesos. I'm not aware of MPI support in Mesos. So we should find someone with good knowledge to lead the discussion here.