Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39375

SPIP: Spark Connect - A client and server interface for Apache Spark

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Improvement
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.4.0
    • None
    • Connect


      Please find the full document for discussion here: Spark Connect SPIP Below, we have just referenced the introduction.

      What are you trying to do?

      While Spark is used extensively, it was designed nearly a decade ago, which, in the age of serverless computing and ubiquitous programming language use, poses a number of limitations. Most of the limitations stem from the tightly coupled Spark driver architecture and fact that clusters are typically shared across users: (1) Lack of built-in remote connectivity: the Spark driver runs both the client application and scheduler, which results in a heavyweight architecture that requires proximity to the cluster. There is no built-in capability to  remotely connect to a Spark cluster in languages other than SQL and users therefore rely on external solutions such as the inactive project Apache Livy. (2) Lack of rich developer experience: The current architecture and APIs do not cater for interactive data exploration (as done with Notebooks), or allow for building out rich developer experience common in modern code editors. (3) Stability: with the current shared driver architecture, users causing critical exceptions (e.g. OOM) bring the whole cluster down for all users. (4) Upgradability: the current entangling of platform and client APIs (e.g. first and third-party dependencies in the classpath) does not allow for seamless upgrades between Spark versions (and with that, hinders new feature adoption).


      We propose to overcome these challenges by building on the DataFrame API and the underlying unresolved logical plans. The DataFrame API is widely used and makes it very easy to iteratively express complex logic. We will introduce Spark Connect, a remote option of the DataFrame API that separates the client from the Spark server. With Spark Connect, Spark will become decoupled, allowing for built-in remote connectivity: The decoupled client SDK can be used to run interactive data exploration and connect to the server for DataFrame operations. 


      Spark Connect will benefit Spark developers in different ways: The decoupled architecture will result in improved stability, as clients are separated from the driver. From the Spark Connect client perspective, Spark will be (almost) versionless, and thus enable seamless upgradability, as server APIs can evolve without affecting the client API. The decoupled client-server architecture can be leveraged to build close integrations with local developer tooling. Finally, separating the client process from the Spark server process will improve Spark’s overall security posture by avoiding the tight coupling of the client inside the Spark runtime environment.


      Spark Connect will strengthen Spark’s position as the modern unified engine for large-scale data analytics and expand applicability to use cases and developers we could not reach with the current setup: Spark will become ubiquitously usable as the DataFrame API can be used with (almost) any programming language.


        High-Level design doc for Spark Connect Sub-task Open Unassigned Actions
        Initial protobuf definition for Spark Connect API Sub-task Open Unassigned Actions
        Prototype implementation Sub-task Resolved Martin Grund Actions
        Extend test coverage of Analyzer Sub-task Open Unassigned Actions
        Extend test coverage for Spark Connect Python client Sub-task Open Unassigned Actions
        Type annotations for Spark Connect Python client Sub-task Open Unassigned Actions
        Developer documentation Sub-task Open Unassigned Actions
        Improve error handling for GRPC server Sub-task Open Unassigned Actions
        Initial DSL framework for protobuf testing Sub-task Resolved Rui Wang Actions
        Add `CONNECT` label Sub-task Resolved Hyukjin Kwon Actions
        Python version for UDF should follow the servers version Sub-task Open Unassigned Actions
        Extend type support for Spark Connect literals Sub-task Open Unassigned Actions
        Extend support for Join Relation Sub-task Open Unassigned Actions
        Make Spark Connect port configurable. Sub-task Resolved Martin Grund Actions
        Re-enable mypi supoprt Sub-task Resolved Martin Grund Actions
        Add missing PySpark functions to Spark Connect Sub-task Open Unassigned Actions
        PySpark readwriter API parity for Spark Connect Sub-task In Progress Unassigned Actions
        Homogenize built-in function names for binary operators like + - / Sub-task Open Unassigned Actions
        Re-generate Spark Connect Python protos Sub-task Resolved Martin Grund Actions
        Decouple plan transformation and validation on server side Sub-task Open Unassigned Actions
        SELECT * shouldn't be empty project list in proto. Sub-task Resolved Rui Wang Actions
        protoc-3.21.1-linux-x86_64.exe requires GLIBC_2.14 Sub-task Open Unassigned Actions
        Connect module should use log4j2.properties to configure test log output as other modules Sub-task Resolved Yang Jie Actions
        Update sbt-protoc to 1.0.6 Sub-task Resolved Yuming Wang Actions
        Throw exception for Collect() and recommend to use toPandas() Sub-task Resolved Rui Wang Actions
        Avoid embedding Spark Connect in the Apache Spark binary release Sub-task Resolved Hyukjin Kwon Actions
        Run Scala side tests in GitHub Actions Sub-task Resolved Hyukjin Kwon Actions
        Use uniitest's asserts instead of built-in assert Sub-task Resolved Hyukjin Kwon Actions
        Shade more dependency to be able to run separately Sub-task Resolved Hyukjin Kwon Actions
        Avoid hardcoded versions in SBT build Sub-task Resolved Hyukjin Kwon Actions
        mypy complains accessing the variable defined in the class method Sub-task Resolved Rui Wang Actions


          This comment will be Viewable by All Users Viewable by All Users


            Unassigned Unassigned
            grundprinzip-db Martin Grund
            Herman van Hövell Herman van Hövell




                Issue deployment