Please find the full document for discussion here: Spark Connect SPIP Below, we have just referenced the introduction.
While Spark is used extensively, it was designed nearly a decade ago, which, in the age of serverless computing and ubiquitous programming language use, poses a number of limitations. Most of the limitations stem from the tightly coupled Spark driver architecture and fact that clusters are typically shared across users: (1) Lack of built-in remote connectivity: the Spark driver runs both the client application and scheduler, which results in a heavyweight architecture that requires proximity to the cluster. There is no built-in capability to remotely connect to a Spark cluster in languages other than SQL and users therefore rely on external solutions such as the inactive project Apache Livy. (2) Lack of rich developer experience: The current architecture and APIs do not cater for interactive data exploration (as done with Notebooks), or allow for building out rich developer experience common in modern code editors. (3) Stability: with the current shared driver architecture, users causing critical exceptions (e.g. OOM) bring the whole cluster down for all users. (4) Upgradability: the current entangling of platform and client APIs (e.g. first and third-party dependencies in the classpath) does not allow for seamless upgrades between Spark versions (and with that, hinders new feature adoption).
We propose to overcome these challenges by building on the DataFrame API and the underlying unresolved logical plans. The DataFrame API is widely used and makes it very easy to iteratively express complex logic. We will introduce Spark Connect, a remote option of the DataFrame API that separates the client from the Spark server. With Spark Connect, Spark will become decoupled, allowing for built-in remote connectivity: The decoupled client SDK can be used to run interactive data exploration and connect to the server for DataFrame operations.
Spark Connect will benefit Spark developers in different ways: The decoupled architecture will result in improved stability, as clients are separated from the driver. From the Spark Connect client perspective, Spark will be (almost) versionless, and thus enable seamless upgradability, as server APIs can evolve without affecting the client API. The decoupled client-server architecture can be leveraged to build close integrations with local developer tooling. Finally, separating the client process from the Spark server process will improve Spark’s overall security posture by avoiding the tight coupling of the client inside the Spark runtime environment.
Spark Connect will strengthen Spark’s position as the modern unified engine for large-scale data analytics and expand applicability to use cases and developers we could not reach with the current setup: Spark will become ubiquitously usable as the DataFrame API can be used with (almost) any programming language.
Issues in epic
|SPARK-41282||Feature parity: Column API in Spark Connect||Reopened||Rui Wang|
|SPARK-41283||Feature parity: functions API in Spark Connect||Open||Xinrong Meng|
|SPARK-41279||Feature parity: DataFrame API in Spark Connect||Open||Ruifeng Zheng|
|SPARK-41281||Feature parity: SparkSession API in Spark Connect||Open||Ruifeng Zheng|
|SPARK-41284||Feature parity: I/O in Spark Connect||Open||Rui Wang|
|SPARK-41289||Feature parity: Catalog API||Open||Rui Wang|
|SPARK-41286||Build, package and infrastructure for Spark Connect||Open||Hyukjin Kwon|
|SPARK-40451||Type annotations for Spark Connect Python client||Open||Unassigned|
|SPARK-41285||Test basework and improvement of test coverage in Spark Connect||Open||Hyukjin Kwon|
|SPARK-41288||Server-specific improvement, error handling and API||Open||Martin Grund|
|SPARK-41305||Connect Proto Completeness||Reopened||Rui Wang|
|SPARK-41386||There are some small files when using rebalance(column)||Open||Unassigned|