[SPARK-5497] start-all script not working properly on Standalone HA cluster (with Zookeeper) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.2.0
Fix Version/s: None
Component/s: Deploy
Labels:
- Configuration
- Deployment
- Spark
- bulk-closed
- start

Description

I have configured a Standalone HA cluster with Zookeeper with:

3 Zookeeper nodes
2 Spark master nodes (1 alive and 1 in standby mode)
2 Spark slave nodes

While executing start-all.sh on each master, it will start the master and start a worker on each configured slave.
If alive master goes down, those worker are supposed to reconfigure themselves to use the new active master automatically.

I have noticed that the spark-env property SPARK_MASTER_IP is used in both called scripts, start-master and start-slaves.

The problem is that if you configure SPARK_MASTER_IP with the active master ip, when it goes down, workers don't reassign themselves to the new active master.
And if you configure SPARK_MASTER_IP with the masters cluster route (well, an approximation, because you have to write master's port in all-but-last ips, that is "master1:7077,master2", in order to make it work), slaves start properly but master doesn't.

So, the start-master script needs SPARK_MASTER_IP property to contain its ip in order to start master properly; and start-slaves script needs SPARK_MASTER_IP property to contain the masters cluster ips (that is "master1:7077,master2")

To test that idea, I have modified start-slaves and spark-env scripts on master nodes.
On spark-env.sh, I have set SPARK_MASTER_IP property to master's own ip on each master node (that is, on master node 1, SPARK_MASTER_IP=master1; and on master node 2, SPARK_MASTER_IP=master2)
On spark-env.sh, I have added a new property SPARK_MASTER_CLUSTER_IP with the pseudo-masters-cluster-ips (SPARK_MASTER_CLUSTER_IP=master1:7077,master2) on both masters.
On start-slaves.sh, I have modified all references to SPARK_MASTER_IP to SPARK_MASTER_CLUSTER_IP.
I have tried that and it works great! When active master node goes down, all workers reassign themselves to the new active node.

Maybe there is a better fix for this issue.
Hope this quick-fix idea can help.

Attachments

Issue Links

is duplicated by

SPARK-5265 Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Roque Vassal'lo

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 30/Jan/15 09:09

Updated:: 21/May/19 05:37

Resolved:: 21/May/19 05:37