Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27941

Serverless Spark in the Cloud




      Public cloud providers have started offering serverless container services. For example, AWS offers Fargate https://aws.amazon.com/fargate/

      This opens up the possibility to run Spark workloads in a serverless manner and remove the need to provision, maintain and manage a cluster. POC: https://github.com/mu5358271/spark-on-fargate

      While it might not make sense for Spark to favor any particular cloud provider or to support a large number of cloud providers natively, it would make sense to make some of the internal Spark components more pluggable and cloud friendly so that it is easier for various cloud providers to integrate. For example, 

      • authentication: IO and network encryption requires authentication via securely sharing a secret, and the implementation of this is currently tied to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file mounted on all pods. These can be decoupled so it is possible to swap in implementation using public cloud. In the POC, this is implemented by passing around AWS KMS encrypted secret and decrypting the secret at each executor, which delegate authentication and authorization to the cloud.
      • deployment & scheduler: adding a new cluster manager and scheduler backend requires changing a number of places in the Spark core package and rebuilding the entire project. Having a pluggable scheduler per https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add different scheduler backends backed by different cloud providers.
      • client-cluster communication: I am not very familiar with the network part of the code base so I might be wrong on this. My understanding is that the code base assumes that the client and the cluster are on the same network and the nodes communicate with each other via hostname/ip. For security best practice, it is advised to run the executors in a private protected network, which may be separate from the client machine's network. Since we are serverless, that means the client need to first launch the driver into the private network, and the driver in turn start the executors, potentially doubling job initialization time. This can be solved by dropping complete serverlessness and having a persistent host in the private network, or (I do not have a POC, so I am not sure if this actually works) by implementing client-cluster communication via message queues in the cloud to get around the network separation.
      • shuffle storage and retrieval: external shuffle in yarn relies on the existence of a persistent cluster that continues to serve shuffle files beyond the lifecycle of the executors. This assumption no longer holds in a serverless cluster with only transient containers. Pluggable remote shuffle storage per https://issues.apache.org/jira/browse/SPARK-25299 would make it easier to introduce new cloud-backed shuffle.


        Issue Links



              Unassigned Unassigned
              mu5358271 Shuheng Dai
              5 Vote for this issue
              25 Start watching this issue