Large and highly multi-tenant Spark on YARN clusters with diverse job execution often display terrible utilization rates (we have observed as low as 3-7% CPU at max container allocation, but 50% CPU utilization on even a well policed cluster is not uncommon).
As a sizing example, consider a scenario with 1,000 nodes, 50,000 cores, 250 users and 50,000 runs of 1,000 distinct applications per week, with predominantly Spark including a mixture of ETL, Ad Hoc tasks and PySpark Notebook jobs (no streaming)
Utilization problems appear to be due in large part to difficulties with persist() blocks (DISK or DISK+MEMORY) preventing dynamic deallocation.
In situations where an external shuffle service is present (which is typical on clusters of this type) we already solve this for the shuffle block case by offloading the IO handling of shuffle blocks to the external service, allowing dynamic deallocation to proceed.
Allowing Executors to transfer persist() blocks to some external "shuffle" service in a similar manner would be an enormous win for Spark multi-tenancy as it would limit deallocation blocking scenarios to only MEMORY-only cache() scenarios.
I'm not sure if I'm correct, but I seem to recall seeing in the original external shuffle service commits that may have been considered at the time but getting shuffle blocks moved to the external shuffle service was the first priority.
With support for external persist() DISK blocks in place, we could also then handle deallocation of DISK+MEMORY, as the memory instance could first be dropped, changing the block to DISK only, and then further transferred to the shuffle service.
We have tried to resolve the persist() issue via extensive user training, but that has typically only allowed us to improve utilization of the worst offenders (10% utilization) up to around 40-60% utilization, as the need for persist() is often legitimate and occurs during the middle stages of a job.
In a healthy multi-tenant scenario, a large job might spool up to say 10,000 cores, persist() data, release executors across a long tail down to 100 cores, and then spool back up to 10,000 cores for the following stage without impact on the persist() data.
In an ideal world, if an new executor started up on a node on which blocks had been transferred to the shuffle service, the new executor might even be able to "recapture" control of those blocks (if that would help with performance in some way).
And the behavior of gradually expanding up and down several times over the course of a job would not just improve utilization, but would allow resources to more easily be redistributed to other jobs which start on the cluster during the long-tail periods, which would improve multi-tenancy and bring us closer to optimal "envy free" YARN scheduling.