[SPARK-39375] SPIP: Spark Connect - A client and server interface for Apache Spark - ASF JIRA

Details

Type: Epic
Status: Reopened
Priority: Critical
Resolution: Unresolved
Affects Version/s: 3.4.0
Fix Version/s: None
Component/s: Connect
Labels:
- SPIP

Epic Name:
Spark Connect

Description

Please find the full document for discussion here: Spark Connect SPIP Below, we have just referenced the introduction.

What are you trying to do?

While Spark is used extensively, it was designed nearly a decade ago, which, in the age of serverless computing and ubiquitous programming language use, poses a number of limitations. Most of the limitations stem from the tightly coupled Spark driver architecture and fact that clusters are typically shared across users: (1) Lack of built-in remote connectivity: the Spark driver runs both the client application and scheduler, which results in a heavyweight architecture that requires proximity to the cluster. There is no built-in capability to remotely connect to a Spark cluster in languages other than SQL and users therefore rely on external solutions such as the inactive project Apache Livy. (2) Lack of rich developer experience: The current architecture and APIs do not cater for interactive data exploration (as done with Notebooks), or allow for building out rich developer experience common in modern code editors. (3) Stability: with the current shared driver architecture, users causing critical exceptions (e.g. OOM) bring the whole cluster down for all users. (4) Upgradability: the current entangling of platform and client APIs (e.g. first and third-party dependencies in the classpath) does not allow for seamless upgrades between Spark versions (and with that, hinders new feature adoption).

We propose to overcome these challenges by building on the DataFrame API and the underlying unresolved logical plans. The DataFrame API is widely used and makes it very easy to iteratively express complex logic. We will introduce Spark Connect, a remote option of the DataFrame API that separates the client from the Spark server. With Spark Connect, Spark will become decoupled, allowing for built-in remote connectivity: The decoupled client SDK can be used to run interactive data exploration and connect to the server for DataFrame operations.

Spark Connect will benefit Spark developers in different ways: The decoupled architecture will result in improved stability, as clients are separated from the driver. From the Spark Connect client perspective, Spark will be (almost) versionless, and thus enable seamless upgradability, as server APIs can evolve without affecting the client API. The decoupled client-server architecture can be leveraged to build close integrations with local developer tooling. Finally, separating the client process from the Spark server process will improve Spark’s overall security posture by avoiding the tight coupling of the client inside the Spark runtime environment.

Spark Connect will strengthen Spark’s position as the modern unified engine for large-scale data analytics and expand applicability to use cases and developers we could not reach with the current setup: Spark will become ubiquitously usable as the DataFrame API can be used with (almost) any programming language.

SPARK-41282	Feature parity: Column API in Spark Connect	REOPENED	Ruifeng Zheng	Actions
~~SPARK-41283~~	Feature parity: Functions API in Spark Connect	RESOLVED	Ruifeng Zheng	Actions
SPARK-41279	Feature parity: DataFrame API in Spark Connect	OPEN	Ruifeng Zheng	Actions
~~SPARK-41281~~	Feature parity: SparkSession API in Spark Connect	OPEN	Ruifeng Zheng	Actions
SPARK-41284	Feature parity: I/O in Spark Connect	REOPENED	Rui Wang	Actions
~~SPARK-41289~~	Feature parity: Catalog API	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-41286~~	Build, package and infrastructure for Spark Connect	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-40451~~	Type annotations for Spark Connect Python client	RESOLVED	Hyukjin Kwon	Actions
SPARK-40452	Developer documentation	OPEN	Unassigned	Actions
SPARK-41285	Test basework and improvement of test coverage in Spark Connect	OPEN	Hyukjin Kwon	Actions
SPARK-41288	Server-specific improvement, error handling and API	OPEN	Martin Grund	Actions
SPARK-41305	Connect Proto Completeness	REOPENED	Rui Wang	Actions
SPARK-41531	Debugging and Stability	OPEN	Unassigned	Actions
~~SPARK-41625~~	Feature parity: Streaming support	OPEN	Unassigned	Actions
SPARK-41627	Spark Connect Server Development	OPEN	Unassigned	Actions
~~SPARK-41642~~	Deduplicate docstrings in Python Spark Connect	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-41651~~	Test parity: pyspark.sql.tests.test_dataframe	RESOLVED	Sandeep Singh	Actions
~~SPARK-41652~~	Test parity: pyspark.sql.tests.test_functions	RESOLVED	Sandeep Singh	Actions
~~SPARK-41661~~	Support for User-defined Functions in Python	RESOLVED	Xinrong Meng	Actions
~~SPARK-41653~~	Test parity: enable doctests in Spark Connect	RESOLVED	Sandeep Singh	Actions
~~SPARK-41932~~	Bootstrapping Spark Connect	OPEN	Hyukjin Kwon	Actions
~~SPARK-41997~~	Test parity: pyspark.sql.tests.test_readwriter	RESOLVED	Unassigned	Actions
~~SPARK-42006~~	Test parity: pyspark.sql.tests.test_group, test_serde, test_datasources and test_column	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-42018~~	Test parity: pyspark.sql.tests.test_types	RESOLVED	Unassigned	Actions
~~SPARK-42156~~	Support client-side retries in Spark Connect Python client	RESOLVED	Martin Grund	Actions
~~SPARK-42264~~	Test Parity: pyspark.sql.tests.test_udf and pyspark.sql.tests.pandas.test_pandas_udf	RESOLVED	Xinrong Meng	Actions
SPARK-42374	User-facing documentation	OPEN	Haejoon Lee	Actions
~~SPARK-42393~~	Support for Pandas/Arrow Functions API	RESOLVED	Xinrong Meng	Actions
SPARK-42471	Distributed ML <> spark connect	OPEN	Unassigned	Actions
~~SPARK-42497~~	Support of pandas API on Spark for Spark Connect	IN PROGRESS	Unassigned	Actions
~~SPARK-42499~~	Support for Runtime SQL configuration	RESOLVED	Takuya Ueshin	Actions
SPARK-43289	PySpark UDF supports python package dependencies	OPEN	Weichen Xu	Actions
~~SPARK-43612~~	Python: Artifact transfer from Scala/JVM client to Server	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-43747~~	Implement the pyfile support in SparkSession.addArtifacts	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-43768~~	Implement the archive support in SparkSession.addArtifacts	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-43795~~	Remove parameters not used for SparkConnectPlanner	RESOLVED	jiaan.geng	Actions
SPARK-43829	Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset	OPEN	Unassigned	Actions
~~SPARK-44135~~	Document Spark Connect only API in PySpark	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-44290~~	Session-based files and archives in Spark Connect	RESOLVED	Hyukjin Kwon	Actions

Attachments

Issue Links

incorporates

SPARK-42554 Spark Connect Scala Client

Reopened

Sub-Tasks

1.	High-Level design doc for Spark Connect	Resolved	Martin Grund
2.	Initial protobuf definition for Spark Connect API	Resolved	Unassigned
3.	Initial prototype implementation	Resolved	Martin Grund

SPIP: Spark Connect - A client and server interface for Apache Spark

Details

Description

What are you trying to do?

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates