[SPARK-6188] Instance types can be mislabeled when re-starting cluster with default arguments - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1
Fix Version/s: 1.4.0
Component/s: EC2
Labels:
None

Description

This was discovered when investigating https://issues.apache.org/jira/browse/SPARK-5838.

In short, when restarting a cluster that you launched with an alternative instance type, you have to provide the instance type(s) again in the "/spark-ec2 -i <key-file> --region=<ec2-region> start <cluster-name>" command. Otherwise it gets set to the default m1.large.

This then affects the setup of the machines.

I'll submit a pull request that takes cares of this, without the user needing to provide the instance type(s) again.

EDIT:

Example case where this becomes a problem:
1. User launches a cluster with instances with 1 disk, ex. m3.large.
2. The user stops the cluster.
3. When the user restarts the cluster with the start command without providing the instance type, the setup is performed using the default instance type, m1.large, which assumes 2 disks present in the machine.
4. The SPARK_LOCAL_DIRS is then set to "mnt/spark,mnt2/spark". /mnt2 corresponds to the snapshot partition in a m3.large instance, which is only 8GB in size. When the user runs jobs that shuffle data, this partition fills up quickly, resulting in failed jobs due to "No space left on device" errors.

Apart from this example one could come up with other examples where the setup of the machines is wrong, due to assuming that they are of type m1.large.

Attachments

Issue Links

relates to

SPARK-5838 Changing SPARK_LOCAL_DIRS option in spark-env.sh does not take effect without daemon restart

Resolved

links to

[Github] Pull Request #4916 (thvasilo)

Activity

People

Assignee:: Theodore Vasiloudis

Reporter:: Theodore Vasiloudis

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Mar/15 17:04

Updated:: 09/Mar/15 14:17

Resolved:: 09/Mar/15 14:16