Description
spark-ec2 currently has retry logic for when it tries to install stuff on a cluster and for when it tries to destroy security groups.
It would be better to have some logic that allows spark-ec2 to explicitly wait for when all the nodes in a cluster it is working on have reached a specific state.
Examples:
- Wait for all nodes to be up
- Wait for all nodes to be up and accepting SSH connections (then start installing stuff)
- Wait for all nodes to be down
- Wait for all nodes to be terminated (then delete the security groups)
Having a function in the spark_ec2.py script that blocks until the desired cluster state is reached would reduce the need for various retry logic. It would probably also eliminate the need for the --wait parameter.
Attachments
Issue Links
- incorporates
-
SPARK-1751 spark ec2 scripts should check for SSh to be up
- Resolved
- is related to
-
SPARK-1574 ec2/spark_ec2.py should provide option to control number of attempts for ssh operations
- Resolved
-
SPARK-5473 Expose SSH failures after status checks pass
- Resolved
- links to