Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39375

SPIP: Spark Connect - A client and server interface for Apache Spark

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Epic
    • Status: Reopened
    • Critical
    • Resolution: Unresolved
    • 3.4.0
    • None
    • Connect
    • Spark Connect

    Description

      Please find the full document for discussion here: Spark Connect SPIP Below, we have just referenced the introduction.

      What are you trying to do?

      While Spark is used extensively, it was designed nearly a decade ago, which, in the age of serverless computing and ubiquitous programming language use, poses a number of limitations. Most of the limitations stem from the tightly coupled Spark driver architecture and fact that clusters are typically shared across users: (1) Lack of built-in remote connectivity: the Spark driver runs both the client application and scheduler, which results in a heavyweight architecture that requires proximity to the cluster. There is no built-in capability to  remotely connect to a Spark cluster in languages other than SQL and users therefore rely on external solutions such as the inactive project Apache Livy. (2) Lack of rich developer experience: The current architecture and APIs do not cater for interactive data exploration (as done with Notebooks), or allow for building out rich developer experience common in modern code editors. (3) Stability: with the current shared driver architecture, users causing critical exceptions (e.g. OOM) bring the whole cluster down for all users. (4) Upgradability: the current entangling of platform and client APIs (e.g. first and third-party dependencies in the classpath) does not allow for seamless upgrades between Spark versions (and with that, hinders new feature adoption).

       

      We propose to overcome these challenges by building on the DataFrame API and the underlying unresolved logical plans. The DataFrame API is widely used and makes it very easy to iteratively express complex logic. We will introduce Spark Connect, a remote option of the DataFrame API that separates the client from the Spark server. With Spark Connect, Spark will become decoupled, allowing for built-in remote connectivity: The decoupled client SDK can be used to run interactive data exploration and connect to the server for DataFrame operations. 

       

      Spark Connect will benefit Spark developers in different ways: The decoupled architecture will result in improved stability, as clients are separated from the driver. From the Spark Connect client perspective, Spark will be (almost) versionless, and thus enable seamless upgradability, as server APIs can evolve without affecting the client API. The decoupled client-server architecture can be leveraged to build close integrations with local developer tooling. Finally, separating the client process from the Spark server process will improve Spark’s overall security posture by avoiding the tight coupling of the client inside the Spark runtime environment.

       

      Spark Connect will strengthen Spark’s position as the modern unified engine for large-scale data analytics and expand applicability to use cases and developers we could not reach with the current setup: Spark will become ubiquitously usable as the DataFrame API can be used with (almost) any programming language.
       

        SPARK-41282 Feature parity: Column API in Spark Connect REOPENED Ruifeng Zheng Actions
        SPARK-41283 Feature parity: Functions API in Spark Connect RESOLVED Ruifeng Zheng Actions
        SPARK-41279 Feature parity: DataFrame API in Spark Connect OPEN Ruifeng Zheng Actions
        SPARK-41281 Feature parity: SparkSession API in Spark Connect OPEN Ruifeng Zheng Actions
        SPARK-41284 Feature parity: I/O in Spark Connect REOPENED Rui Wang Actions
        SPARK-41289 Feature parity: Catalog API RESOLVED Hyukjin Kwon Actions
        SPARK-41286 Build, package and infrastructure for Spark Connect RESOLVED Hyukjin Kwon Actions
        SPARK-40451 Type annotations for Spark Connect Python client RESOLVED Hyukjin Kwon Actions
        SPARK-40452 Developer documentation OPEN Unassigned Actions
        SPARK-41285 Test basework and improvement of test coverage in Spark Connect OPEN Hyukjin Kwon Actions
        SPARK-41288 Server-specific improvement, error handling and API OPEN Martin Grund Actions
        SPARK-41305 Connect Proto Completeness REOPENED Rui Wang Actions
        SPARK-41531 Debugging and Stability OPEN Unassigned Actions
        SPARK-41625 Feature parity: Streaming support OPEN Unassigned Actions
        SPARK-41627 Spark Connect Server Development OPEN Unassigned Actions
        SPARK-41642 Deduplicate docstrings in Python Spark Connect RESOLVED Hyukjin Kwon Actions
        SPARK-41651 Test parity: pyspark.sql.tests.test_dataframe RESOLVED Sandeep Singh Actions
        SPARK-41652 Test parity: pyspark.sql.tests.test_functions RESOLVED Sandeep Singh Actions
        SPARK-41661 Support for User-defined Functions in Python RESOLVED Xinrong Meng Actions
        SPARK-41653 Test parity: enable doctests in Spark Connect RESOLVED Sandeep Singh Actions
        SPARK-41932 Bootstrapping Spark Connect OPEN Hyukjin Kwon Actions
        SPARK-41997 Test parity: pyspark.sql.tests.test_readwriter RESOLVED Unassigned Actions
        SPARK-42006 Test parity: pyspark.sql.tests.test_group, test_serde, test_datasources and test_column RESOLVED Hyukjin Kwon Actions
        SPARK-42018 Test parity: pyspark.sql.tests.test_types RESOLVED Unassigned Actions
        SPARK-42156 Support client-side retries in Spark Connect Python client RESOLVED Martin Grund Actions
        SPARK-42264 Test Parity: pyspark.sql.tests.test_udf and pyspark.sql.tests.pandas.test_pandas_udf RESOLVED Xinrong Meng Actions
        SPARK-42374 User-facing documentation OPEN Haejoon Lee Actions
        SPARK-42393 Support for Pandas/Arrow Functions API RESOLVED Xinrong Meng Actions
        SPARK-42471 Distributed ML <> spark connect OPEN Unassigned Actions
        SPARK-42497 Support of pandas API on Spark for Spark Connect IN PROGRESS Unassigned Actions
        SPARK-42499 Support for Runtime SQL configuration RESOLVED Takuya Ueshin Actions
        SPARK-43289 PySpark UDF supports python package dependencies OPEN Weichen Xu Actions
        SPARK-43612 Python: Artifact transfer from Scala/JVM client to Server RESOLVED Hyukjin Kwon Actions
        SPARK-43747 Implement the pyfile support in SparkSession.addArtifacts RESOLVED Hyukjin Kwon Actions
        SPARK-43768 Implement the archive support in SparkSession.addArtifacts RESOLVED Hyukjin Kwon Actions
        SPARK-43795 Remove parameters not used for SparkConnectPlanner RESOLVED jiaan.geng Actions
        SPARK-43829 Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset OPEN Unassigned Actions
        SPARK-44135 Document Spark Connect only API in PySpark RESOLVED Hyukjin Kwon Actions
        SPARK-44290 Session-based files and archives in Spark Connect RESOLVED Hyukjin Kwon Actions

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            grundprinzip-db Martin Grund
            grundprinzip-db Martin Grund
            Hyukjin Kwon Hyukjin Kwon

            Dates

              Created:
              Updated:

              Slack

                Issue deployment