[SPARK-5189] Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: EC2
Labels:
None

Description

As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, then setting up all the slaves together. This includes broadcasting files from the lonely master to potentially hundreds of slaves.

There are 2 main problems with this approach:

Broadcasting files from the master to all slaves using copy-dir (e.g. during ephemeral-hdfs init, or during Spark setup) takes a long time. This time increases as the number of slaves increases.
I did some testing in us-east-1. This is, concretely, what the problem looks like:

number of slaves (m3.large) launch time (best of 6 tries)

1 8m 44s

10 13m 45s

25 22m 50s

50 37m 30s

75 51m 30s

99 1h 5m 30s

Unfortunately, I couldn't report on 100 slaves or more due to ~~SPARK-6246~~, but I think the point is clear enough.
We can extrapolate from this data that every additional slave adds roughly 35 seconds to the launch time (so a cluster with 100 slaves would take 1h 6m 5s to launch).
It's more complicated to add slaves to an existing cluster (a la ~~SPARK-2008~~), since slaves are only configured through the master during the setup of the master itself.

Logically, the operations we want to implement are:

Provision a Spark node
Join a node to a cluster (including an empty cluster) as either a master or a slave
Remove a node from a cluster

We need our scripts to roughly be organized to match the above operations. The goals would be:

When launching a cluster, enable all cluster nodes to be provisioned in parallel, removing the master-to-slave file broadcast bottleneck.
Facilitate cluster modifications like adding or removing nodes.
Enable exploration of infrastructure tools like Terraform that might simplify spark-ec2 internals and perhaps even allow us to build one tool that launches Spark clusters on several different cloud platforms.

More concretely, the modifications we need to make are:

Replace all occurrences of copy-dir or rsync-to-slaves with equivalent, slave-side operations.
Repurpose setup-slave.sh as provision-spark-node.sh and make sure it fully creates a node that can be used as either a master or slave.
Create a new script, join-to-cluster.sh, that takes a provisioned node, configures it as a master or slave, and joins it to a cluster.
Move any remaining logic in setup.sh up to spark_ec2.py and delete that script.

Attachments

Issue Links

blocks

SPARK-2008 Enhance spark-ec2 to be able to add and remove slaves to an existing cluster

Resolved

Is contained by

SPARK-4325 Improve spark-ec2 cluster launch times

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Nicholas Chammas

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Jan/15 19:42

Updated:: 27/Jan/16 13:25

Resolved:: 27/Jan/16 10:13

number of slaves (`m3.large`)	launch time (best of 6 tries)
1	8m 44s
10	13m 45s
25	22m 50s
50	37m 30s
75	51m 30s
99	1h 5m 30s