I sent an email on the dev mailing list but got no response, hence filing a JIRA ticket.
PBS (Portable Batch System) Professional is an open sourced workload management system for HPC clusters. Many organizations using PBS for managing their cluster also use Spark for Big Data but they are forced to divide the cluster into Spark cluster and PBS cluster either physically dividing the cluster nodes into two groups or starting Spark Standalone cluster manager's Master and Slaves as PBS jobs, leading to underutilization of resources.
I am trying to add support in Spark to use PBS as a pluggable cluster manager. Going through the Spark codebase and looking at Mesos and Kubernetes integration, I found that we can get this working as follows:
- Extend `ExternalClusterManager`.
- Extend `CoarseGrainedSchedulerBackend`
- This class can start `Executors` as PBS jobs.
- The initial number of `Executors` are started `onStart`.
- More `Executors` can be started as and when required using `doRequestTotalExecutors`.
- `Executors` can be killed using `doKillExecutors`.
- Extend `SparkApplication` to start `Driver` as a PBS job in cluster deploy mode.
- This extended class can submit the Spark application again as a PBS job which with deploy mode = client, so that the application driver is started on a node in the cluster.
I have a couple of questions:
- Does this seem like a good idea to do this or should we look at other options?
- What are the expectations from the initial prototype?
- If this works, would Spark maintainers look forward to merging this or would they want it to be maintained as a fork?