Description
Please find the full document for discussion here: Spark Connect SPIP Below, we have just referenced the introduction.
What are you trying to do?
While Spark is used extensively, it was designed nearly a decade ago, which, in the age of serverless computing and ubiquitous programming language use, poses a number of limitations. Most of the limitations stem from the tightly coupled Spark driver architecture and fact that clusters are typically shared across users: (1) Lack of built-in remote connectivity: the Spark driver runs both the client application and scheduler, which results in a heavyweight architecture that requires proximity to the cluster. There is no built-in capability to remotely connect to a Spark cluster in languages other than SQL and users therefore rely on external solutions such as the inactive project Apache Livy. (2) Lack of rich developer experience: The current architecture and APIs do not cater for interactive data exploration (as done with Notebooks), or allow for building out rich developer experience common in modern code editors. (3) Stability: with the current shared driver architecture, users causing critical exceptions (e.g. OOM) bring the whole cluster down for all users. (4) Upgradability: the current entangling of platform and client APIs (e.g. first and third-party dependencies in the classpath) does not allow for seamless upgrades between Spark versions (and with that, hinders new feature adoption).
We propose to overcome these challenges by building on the DataFrame API and the underlying unresolved logical plans. The DataFrame API is widely used and makes it very easy to iteratively express complex logic. We will introduce Spark Connect, a remote option of the DataFrame API that separates the client from the Spark server. With Spark Connect, Spark will become decoupled, allowing for built-in remote connectivity: The decoupled client SDK can be used to run interactive data exploration and connect to the server for DataFrame operations.
Spark Connect will benefit Spark developers in different ways: The decoupled architecture will result in improved stability, as clients are separated from the driver. From the Spark Connect client perspective, Spark will be (almost) versionless, and thus enable seamless upgradability, as server APIs can evolve without affecting the client API. The decoupled client-server architecture can be leveraged to build close integrations with local developer tooling. Finally, separating the client process from the Spark server process will improve Spark’s overall security posture by avoiding the tight coupling of the client inside the Spark runtime environment.
Spark Connect will strengthen Spark’s position as the modern unified engine for large-scale data analytics and expand applicability to use cases and developers we could not reach with the current setup: Spark will become ubiquitously usable as the DataFrame API can be used with (almost) any programming language.
SPARK-41282 | Feature parity: Column API in Spark Connect | REOPENED | Ruifeng Zheng | Actions | ||
Feature parity: Functions API in Spark Connect | RESOLVED | Ruifeng Zheng | Actions | |||
SPARK-41279 | Feature parity: DataFrame API in Spark Connect | OPEN | Ruifeng Zheng | Actions | ||
Feature parity: SparkSession API in Spark Connect | OPEN | Ruifeng Zheng | Actions | |||
SPARK-41284 | Feature parity: I/O in Spark Connect | REOPENED | Rui Wang | Actions | ||
Feature parity: Catalog API | RESOLVED | Hyukjin Kwon | Actions | |||
Build, package and infrastructure for Spark Connect | RESOLVED | Hyukjin Kwon | Actions | |||
Type annotations for Spark Connect Python client | RESOLVED | Hyukjin Kwon | Actions | |||
SPARK-40452 | Developer documentation | OPEN | Unassigned | Actions | ||
SPARK-41285 | Test basework and improvement of test coverage in Spark Connect | OPEN | Hyukjin Kwon | Actions | ||
SPARK-41288 | Server-specific improvement, error handling and API | OPEN | Martin Grund | Actions | ||
SPARK-41305 | Connect Proto Completeness | REOPENED | Rui Wang | Actions | ||
SPARK-41531 | Debugging and Stability | OPEN | Unassigned | Actions | ||
Feature parity: Streaming support | OPEN | Unassigned | Actions | |||
SPARK-41627 | Spark Connect Server Development | OPEN | Unassigned | Actions | ||
Deduplicate docstrings in Python Spark Connect | RESOLVED | Hyukjin Kwon | Actions | |||
Test parity: pyspark.sql.tests.test_dataframe | RESOLVED | Sandeep Singh | Actions | |||
Test parity: pyspark.sql.tests.test_functions | RESOLVED | Sandeep Singh | Actions | |||
Support for User-defined Functions in Python | RESOLVED | Xinrong Meng | Actions | |||
Test parity: enable doctests in Spark Connect | RESOLVED | Sandeep Singh | Actions | |||
SPARK-41932 | Bootstrapping Spark Connect | OPEN | Hyukjin Kwon | Actions | ||
Test parity: pyspark.sql.tests.test_readwriter | RESOLVED | Unassigned | Actions | |||
Test parity: pyspark.sql.tests.test_group, test_serde, test_datasources and test_column | RESOLVED | Hyukjin Kwon | Actions | |||
Test parity: pyspark.sql.tests.test_types | RESOLVED | Unassigned | Actions | |||
Support client-side retries in Spark Connect Python client | RESOLVED | Martin Grund | Actions | |||
Test Parity: pyspark.sql.tests.test_udf and pyspark.sql.tests.pandas.test_pandas_udf | RESOLVED | Xinrong Meng | Actions | |||
SPARK-42374 | User-facing documentation | OPEN | Haejoon Lee | Actions | ||
Support for Pandas/Arrow Functions API | RESOLVED | Xinrong Meng | Actions | |||
SPARK-42471 | Distributed ML <> spark connect | OPEN | Unassigned | Actions | ||
Support of pandas API on Spark for Spark Connect | IN PROGRESS | Unassigned | Actions | |||
Support for Runtime SQL configuration | RESOLVED | Takuya Ueshin | Actions | |||
SPARK-43289 | PySpark UDF supports python package dependencies | OPEN | Weichen Xu | Actions | ||
Python: Artifact transfer from Scala/JVM client to Server | RESOLVED | Hyukjin Kwon | Actions | |||
Implement the pyfile support in SparkSession.addArtifacts | RESOLVED | Hyukjin Kwon | Actions | |||
Implement the archive support in SparkSession.addArtifacts | RESOLVED | Hyukjin Kwon | Actions | |||
Remove parameters not used for SparkConnectPlanner | RESOLVED | jiaan.geng | Actions | |||
SPARK-43829 | Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset | OPEN | Unassigned | Actions | ||
Document Spark Connect only API in PySpark | RESOLVED | Hyukjin Kwon | Actions | |||
Session-based files and archives in Spark Connect | RESOLVED | Hyukjin Kwon | Actions |
Attachments
Issue Links
- incorporates
-
SPARK-42554 Spark Connect Scala Client
- Reopened
1.
|
High-Level design doc for Spark Connect | Resolved | Martin Grund | |
2.
|
Initial protobuf definition for Spark Connect API | Resolved | Unassigned | |
3.
|
Initial prototype implementation | Resolved | Martin Grund |