[SPARK-39375] SPIP: Spark Connect - A client and server interface for Apache Spark - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Epic
Status: Reopened
Priority: Critical
Resolution: Unresolved
Affects Version/s: 3.4.0
Fix Version/s: None
Component/s: Connect
Labels:
- SPIP

Epic Name:
Spark Connect

Description

Please find the full document for discussion here: Spark Connect SPIP Below, we have just referenced the introduction.

What are you trying to do?

While Spark is used extensively, it was designed nearly a decade ago, which, in the age of serverless computing and ubiquitous programming language use, poses a number of limitations. Most of the limitations stem from the tightly coupled Spark driver architecture and fact that clusters are typically shared across users: (1) Lack of built-in remote connectivity: the Spark driver runs both the client application and scheduler, which results in a heavyweight architecture that requires proximity to the cluster. There is no built-in capability to remotely connect to a Spark cluster in languages other than SQL and users therefore rely on external solutions such as the inactive project Apache Livy. (2) Lack of rich developer experience: The current architecture and APIs do not cater for interactive data exploration (as done with Notebooks), or allow for building out rich developer experience common in modern code editors. (3) Stability: with the current shared driver architecture, users causing critical exceptions (e.g. OOM) bring the whole cluster down for all users. (4) Upgradability: the current entangling of platform and client APIs (e.g. first and third-party dependencies in the classpath) does not allow for seamless upgrades between Spark versions (and with that, hinders new feature adoption).

We propose to overcome these challenges by building on the DataFrame API and the underlying unresolved logical plans. The DataFrame API is widely used and makes it very easy to iteratively express complex logic. We will introduce Spark Connect, a remote option of the DataFrame API that separates the client from the Spark server. With Spark Connect, Spark will become decoupled, allowing for built-in remote connectivity: The decoupled client SDK can be used to run interactive data exploration and connect to the server for DataFrame operations.

Spark Connect will benefit Spark developers in different ways: The decoupled architecture will result in improved stability, as clients are separated from the driver. From the Spark Connect client perspective, Spark will be (almost) versionless, and thus enable seamless upgradability, as server APIs can evolve without affecting the client API. The decoupled client-server architecture can be leveraged to build close integrations with local developer tooling. Finally, separating the client process from the Spark server process will improve Spark’s overall security posture by avoiding the tight coupling of the client inside the Spark runtime environment.

Spark Connect will strengthen Spark’s position as the modern unified engine for large-scale data analytics and expand applicability to use cases and developers we could not reach with the current setup: Spark will become ubiquitously usable as the DataFrame API can be used with (almost) any programming language.

SPARK-41282	Feature parity: Column API in Spark Connect	REOPENED	Ruifeng Zheng	Actions
~~SPARK-41283~~	Feature parity: Functions API in Spark Connect	RESOLVED	Ruifeng Zheng	Actions
SPARK-41279	Feature parity: DataFrame API in Spark Connect	OPEN	Ruifeng Zheng	Actions
~~SPARK-41281~~	Feature parity: SparkSession API in Spark Connect	OPEN	Ruifeng Zheng	Actions
SPARK-41284	Feature parity: I/O in Spark Connect	REOPENED	Rui Wang	Actions
~~SPARK-41289~~	Feature parity: Catalog API	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-41286~~	Build, package and infrastructure for Spark Connect	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-40451~~	Type annotations for Spark Connect Python client	RESOLVED	Hyukjin Kwon	Actions
SPARK-40452	Developer documentation	OPEN	Unassigned	Actions
SPARK-41285	Test basework and improvement of test coverage in Spark Connect	OPEN	Hyukjin Kwon	Actions
SPARK-41288	Server-specific improvement, error handling and API	OPEN	Martin Grund	Actions
SPARK-41305	Connect Proto Completeness	REOPENED	Rui Wang	Actions
SPARK-41531	Debugging and Stability	OPEN	Unassigned	Actions
~~SPARK-41625~~	Feature parity: Streaming support	OPEN	Unassigned	Actions
SPARK-41627	Spark Connect Server Development	OPEN	Unassigned	Actions
~~SPARK-41642~~	Deduplicate docstrings in Python Spark Connect	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-41651~~	Test parity: pyspark.sql.tests.test_dataframe	RESOLVED	Sandeep Singh	Actions
~~SPARK-41652~~	Test parity: pyspark.sql.tests.test_functions	RESOLVED	Sandeep Singh	Actions
~~SPARK-41661~~	Support for User-defined Functions in Python	RESOLVED	Xinrong Meng	Actions
~~SPARK-41653~~	Test parity: enable doctests in Spark Connect	RESOLVED	Sandeep Singh	Actions
~~SPARK-41932~~	Bootstrapping Spark Connect	OPEN	Hyukjin Kwon	Actions
~~SPARK-41997~~	Test parity: pyspark.sql.tests.test_readwriter	RESOLVED	Unassigned	Actions
~~SPARK-42006~~	Test parity: pyspark.sql.tests.test_group, test_serde, test_datasources and test_column	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-42018~~	Test parity: pyspark.sql.tests.test_types	RESOLVED	Unassigned	Actions
~~SPARK-42156~~	Support client-side retries in Spark Connect Python client	RESOLVED	Martin Grund	Actions
~~SPARK-42264~~	Test Parity: pyspark.sql.tests.test_udf and pyspark.sql.tests.pandas.test_pandas_udf	RESOLVED	Xinrong Meng	Actions
SPARK-42374	User-facing documentation	OPEN	Haejoon Lee	Actions
~~SPARK-42393~~	Support for Pandas/Arrow Functions API	RESOLVED	Xinrong Meng	Actions
SPARK-42471	Distributed ML <> spark connect	OPEN	Unassigned	Actions
~~SPARK-42497~~	Support of pandas API on Spark for Spark Connect	IN PROGRESS	Unassigned	Actions
~~SPARK-42499~~	Support for Runtime SQL configuration	RESOLVED	Takuya Ueshin	Actions
SPARK-43289	PySpark UDF supports python package dependencies	OPEN	Weichen Xu	Actions
~~SPARK-43612~~	Python: Artifact transfer from Scala/JVM client to Server	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-43747~~	Implement the pyfile support in SparkSession.addArtifacts	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-43768~~	Implement the archive support in SparkSession.addArtifacts	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-43795~~	Remove parameters not used for SparkConnectPlanner	RESOLVED	jiaan.geng	Actions
SPARK-43829	Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset	OPEN	Unassigned	Actions
~~SPARK-44135~~	Document Spark Connect only API in PySpark	RESOLVED	Hyukjin Kwon	Actions
~~SPARK-44290~~	Session-based files and archives in Spark Connect	RESOLVED	Hyukjin Kwon	Actions

Attachments

Issue Links

Add Link

incorporates

SPARK-42554 Spark Connect Scala Client

Reopened

Delete this link

Sub-Tasks

Create Sub-Task

1.	High-Level design doc for Spark Connect	Resolved	Martin Grund	Actions
2.	Initial protobuf definition for Spark Connect API	Resolved	Unassigned	Actions
3.	Initial prototype implementation	Resolved	Martin Grund	Actions

Issues in epic

quick-create-issue-for-epic-label

SPARK-41282	Feature parity: Column API in Spark Connect	Reopened	Ruifeng Zheng	Actions
SPARK-41283	Feature parity: Functions API in Spark Connect	Resolved	Ruifeng Zheng	Actions
SPARK-41279	Feature parity: DataFrame API in Spark Connect	Open	Ruifeng Zheng	Actions
SPARK-41281	Feature parity: SparkSession API in Spark Connect	Resolved	Ruifeng Zheng	Actions
SPARK-41284	Feature parity: I/O in Spark Connect	Reopened	Rui Wang	Actions
SPARK-41289	Feature parity: Catalog API	Resolved	Hyukjin Kwon	Actions
SPARK-41286	Build, package and infrastructure for Spark Connect	Resolved	Hyukjin Kwon	Actions
SPARK-40451	Type annotations for Spark Connect Python client	Resolved	Hyukjin Kwon	Actions
SPARK-40452	Developer documentation	Open	Unassigned	Actions
SPARK-41285	Test basework and improvement of test coverage in Spark Connect	Open	Hyukjin Kwon	Actions
SPARK-41288	Server-specific improvement, error handling and API	Open	Martin Grund	Actions
SPARK-41305	Connect Proto Completeness	Reopened	Rui Wang	Actions
SPARK-41531	Debugging and Stability	Open	Unassigned	Actions
SPARK-41625	Feature parity: Streaming support	Resolved	Ruifeng Zheng	Actions
SPARK-41627	Spark Connect Server Development	Open	Unassigned	Actions
SPARK-41642	Deduplicate docstrings in Python Spark Connect	Resolved	Hyukjin Kwon	Actions
SPARK-41651	Test parity: pyspark.sql.tests.test_dataframe	Resolved	Sandeep Singh	Actions
SPARK-41652	Test parity: pyspark.sql.tests.test_functions	Resolved	Sandeep Singh	Actions
SPARK-41661	Support for User-defined Functions in Python	Resolved	Xinrong Meng	Actions
SPARK-41653	Test parity: enable doctests in Spark Connect	Resolved	Sandeep Singh	Actions
SPARK-41811	Implement SparkSession.sql's string formatter	Resolved	Ruifeng Zheng	Actions
SPARK-41932	Bootstrapping Spark Connect	Resolved	Hyukjin Kwon	Actions
SPARK-41997	Test parity: pyspark.sql.tests.test_readwriter	Resolved	Hyukjin Kwon	Actions
SPARK-42006	Test parity: pyspark.sql.tests.test_group, test_serde, test_datasources and test_column	Resolved	Hyukjin Kwon	Actions
SPARK-42018	Test parity: pyspark.sql.tests.test_types	Resolved	Unassigned	Actions
SPARK-42156	Support client-side retries in Spark Connect Python client	Resolved	Martin Grund	Actions
SPARK-42264	Test Parity: pyspark.sql.tests.test_udf and pyspark.sql.tests.pandas.test_pandas_udf	Resolved	Xinrong Meng	Actions
SPARK-42374	User-facing documentation	Open	Haejoon Lee	Actions
SPARK-42393	Support for Pandas/Arrow Functions API	Resolved	Xinrong Meng	Actions
SPARK-42471	Distributed ML <> spark connect	Open	Weichen Xu	Actions
SPARK-42497	Support of pandas API on Spark for Spark Connect	Resolved	Unassigned	Actions
SPARK-42499	Support for Runtime SQL configuration	Resolved	Takuya Ueshin	Actions
SPARK-42965	metadata mismatch for StructField when running some tests.	Reopened	Unassigned	Actions
SPARK-43289	PySpark UDF supports python package dependencies	Open	Weichen Xu	Actions
SPARK-43612	Python: Artifact transfer from Scala/JVM client to Server	Resolved	Hyukjin Kwon	Actions
SPARK-43747	Implement the pyfile support in SparkSession.addArtifacts	Resolved	Hyukjin Kwon	Actions
SPARK-43768	Implement the archive support in SparkSession.addArtifacts	Resolved	Hyukjin Kwon	Actions
SPARK-43795	Remove parameters not used for SparkConnectPlanner	Resolved	Jiaan Geng	Actions
SPARK-43829	Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset	Open	Unassigned	Actions
SPARK-43877	Fix behavior difference for compare binary functions.	Open	Unassigned	Actions
SPARK-44135	Document Spark Connect only API in PySpark	Resolved	Hyukjin Kwon	Actions
SPARK-44290	Session-based files and archives in Spark Connect	Resolved	Hyukjin Kwon	Actions
SPARK-44694	Add Default & Active SparkSession for Python Client	Resolved	Hyukjin Kwon	Actions
SPARK-44982	Mark Spark Connect configurations as static configuration	Resolved	Hyukjin Kwon	Actions
SPARK-45486	make add_artifact idempotent	Resolved	Alice Sayutina	Actions
SPARK-49673	Increase maxBatchSize for Connect's sqlCommandResult	Resolved	Robert Dillitz	Actions

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Martin Grund

Reporter:: Martin Grund

Shepherd:: Hyukjin Kwon

Votes:: 26 Vote for this issue

Watchers:: 72 Start watching this issue

Dates

Created:: 03/Jun/22 17:30

Updated:: 16/Sep/24 15:29

Agile

View on Board

SPIP: Spark Connect - A client and server interface for Apache Spark

Details

Description

What are you trying to do?

Attachments

Attachments

Issue Links

Sub-Tasks

Issues in epic

Activity

People

Dates

Agile

Slack

Issue deployment