Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Done
-
3.0.0
-
None
-
None
Description
Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks can see all IP addresses from BarrierTaskContext. It would be simpler to integrate with distributed frameworks like TensorFlow DistributionStrategy if we provide all gather that can let tasks share additional information with others, e.g., an available port.
Note that with all gather, tasks are share their IP addresses as well.
port = ... # get an available port ports = context.all_gather(port) # get all available ports, ordered by task ID ... # set up distributed training service
Attachments
Issue Links
- causes
-
SPARK-31784 Fix test BarrierTaskContextSuite."share messages with allGather() call"
- Resolved
- links to