Details
Description
This was discovered when investigating https://issues.apache.org/jira/browse/SPARK-5838.
In short, when restarting a cluster that you launched with an alternative instance type, you have to provide the instance type(s) again in the "/spark-ec2 -i <key-file> --region=<ec2-region> start <cluster-name>" command. Otherwise it gets set to the default m1.large.
This then affects the setup of the machines.
I'll submit a pull request that takes cares of this, without the user needing to provide the instance type(s) again.
EDIT:
Example case where this becomes a problem:
1. User launches a cluster with instances with 1 disk, ex. m3.large.
2. The user stops the cluster.
3. When the user restarts the cluster with the start command without providing the instance type, the setup is performed using the default instance type, m1.large, which assumes 2 disks present in the machine.
4. The SPARK_LOCAL_DIRS is then set to "mnt/spark,mnt2/spark". /mnt2 corresponds to the snapshot partition in a m3.large instance, which is only 8GB in size. When the user runs jobs that shuffle data, this partition fills up quickly, resulting in failed jobs due to "No space left on device" errors.
Apart from this example one could come up with other examples where the setup of the machines is wrong, due to assuming that they are of type m1.large.
Attachments
Issue Links
- relates to
-
SPARK-5838 Changing SPARK_LOCAL_DIRS option in spark-env.sh does not take effect without daemon restart
- Resolved
- links to