Details
-
Story
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.0.0
-
None
-
None
Description
In barrier mode, to run hybrid distributed DL training jobs, we need to provide users sufficient info and access so they can set up a hybrid distributed training job, e.g., using MPI.
This ticket limits the scope of discussion to Spark + Kubernetes. There were some past and on-going attempts from the Kubenetes community. So we should find someone with good knowledge to lead the discussion here.