Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39375

SPIP: Spark Connect - A client and server interface for Apache Spark

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Critical
    • Resolution: Unresolved
    • 3.4.0
    • None
    • Connect

    Description

      Please find the full document for discussion here: Spark Connect SPIP Below, we have just referenced the introduction.

      What are you trying to do?

      While Spark is used extensively, it was designed nearly a decade ago, which, in the age of serverless computing and ubiquitous programming language use, poses a number of limitations. Most of the limitations stem from the tightly coupled Spark driver architecture and fact that clusters are typically shared across users: (1) Lack of built-in remote connectivity: the Spark driver runs both the client application and scheduler, which results in a heavyweight architecture that requires proximity to the cluster. There is no built-in capability to  remotely connect to a Spark cluster in languages other than SQL and users therefore rely on external solutions such as the inactive project Apache Livy. (2) Lack of rich developer experience: The current architecture and APIs do not cater for interactive data exploration (as done with Notebooks), or allow for building out rich developer experience common in modern code editors. (3) Stability: with the current shared driver architecture, users causing critical exceptions (e.g. OOM) bring the whole cluster down for all users. (4) Upgradability: the current entangling of platform and client APIs (e.g. first and third-party dependencies in the classpath) does not allow for seamless upgrades between Spark versions (and with that, hinders new feature adoption).

       

      We propose to overcome these challenges by building on the DataFrame API and the underlying unresolved logical plans. The DataFrame API is widely used and makes it very easy to iteratively express complex logic. We will introduce Spark Connect, a remote option of the DataFrame API that separates the client from the Spark server. With Spark Connect, Spark will become decoupled, allowing for built-in remote connectivity: The decoupled client SDK can be used to run interactive data exploration and connect to the server for DataFrame operations. 

       

      Spark Connect will benefit Spark developers in different ways: The decoupled architecture will result in improved stability, as clients are separated from the driver. From the Spark Connect client perspective, Spark will be (almost) versionless, and thus enable seamless upgradability, as server APIs can evolve without affecting the client API. The decoupled client-server architecture can be leveraged to build close integrations with local developer tooling. Finally, separating the client process from the Spark server process will improve Spark’s overall security posture by avoiding the tight coupling of the client inside the Spark runtime environment.

       

      Spark Connect will strengthen Spark’s position as the modern unified engine for large-scale data analytics and expand applicability to use cases and developers we could not reach with the current setup: Spark will become ubiquitously usable as the DataFrame API can be used with (almost) any programming language.

      Attachments

        1.
        High-Level design doc for Spark Connect Sub-task Open Unassigned Actions
        2.
        Initial protobuf definition for Spark Connect API Sub-task Open Unassigned Actions
        3.
        Prototype implementation Sub-task Resolved Martin Grund Actions
        4.
        Extend test coverage of Analyzer Sub-task Open Unassigned Actions
        5.
        Extend test coverage for Spark Connect Python client Sub-task Open Unassigned Actions
        6.
        Type annotations for Spark Connect Python client Sub-task Open Unassigned Actions
        7.
        Developer documentation Sub-task Open Unassigned Actions
        8.
        Improve error handling for GRPC server Sub-task Open Unassigned Actions
        9.
        Initial DSL framework for protobuf testing Sub-task Resolved Rui Wang Actions
        10.
        Add `CONNECT` label Sub-task Resolved Hyukjin Kwon Actions
        11.
        Python version for UDF should follow the servers version Sub-task Open Unassigned Actions
        12.
        Extend type support for Spark Connect literals Sub-task Resolved Martin Grund Actions
        13.
        Extend support for Join Relation Sub-task Resolved Rui Wang Actions
        14.
        Make Spark Connect port configurable. Sub-task Resolved Martin Grund Actions
        15.
        Re-enable mypi supoprt Sub-task Resolved Martin Grund Actions
        16.
        Add missing PySpark functions to Spark Connect Sub-task Resolved Martin Grund Actions
        17.
        PySpark readwriter API parity for Spark Connect Sub-task Resolved Rui Wang Actions
        18.
        Homogenize built-in function names for binary operators like + - / Sub-task Open Unassigned Actions
        19.
        Re-generate Spark Connect Python protos Sub-task Resolved Martin Grund Actions
        20.
        Decouple plan transformation and validation on server side Sub-task Open Unassigned Actions
        21.
        SELECT * shouldn't be empty project list in proto. Sub-task Resolved Rui Wang Actions
        22.
        protoc-3.21.1-linux-x86_64.exe requires GLIBC_2.14 Sub-task Resolved Yang Jie Actions
        23.
        Connect module should use log4j2.properties to configure test log output as other modules Sub-task Resolved Yang Jie Actions
        24.
        Update sbt-protoc to 1.0.6 Sub-task Resolved Yuming Wang Actions
        25.
        Throw exception for Collect() and recommend to use toPandas() Sub-task Resolved Rui Wang Actions
        26.
        Avoid embedding Spark Connect in the Apache Spark binary release Sub-task Resolved Hyukjin Kwon Actions
        27.
        Run Scala side tests in GitHub Actions Sub-task Resolved Hyukjin Kwon Actions
        28.
        Use uniitest's asserts instead of built-in assert Sub-task Resolved Hyukjin Kwon Actions
        29.
        Shade more dependency to be able to run separately Sub-task Resolved Hyukjin Kwon Actions
        30.
        Avoid hardcoded versions in SBT build Sub-task Resolved Hyukjin Kwon Actions
        31.
        mypy complains accessing the variable defined in the class method Sub-task Resolved Rui Wang Actions
        32.
        Add groupby to connect DSL and test more than one grouping expressions Sub-task Resolved Rui Wang Actions
        33.
        Improve SET operation support in the proto and the server Sub-task Resolved Rui Wang Actions
        34.
        Support Column Alias in connect DSL Sub-task Resolved Rui Wang Actions
        35.
        Replace shaded netty with grpc netty to avoid double shaded dependency. Sub-task Resolved Martin Grund Actions
        36.
        Add basic support for DataFrameWriter Sub-task Resolved Martin Grund Actions
        37.
        StructType should contain a list of StructField and each field should have a name Sub-task Resolved Rui Wang Actions
        38.
        Add Sample to proto and DSL Sub-task Resolved Rui Wang Actions
        39.
        Add WHERE to Connect proto and DSL Sub-task Resolved Rui Wang Actions
        40.
        Check the generated python protos in GitHub Actions Sub-task Resolved Ruifeng Zheng Actions
        41.
        Enforce Scalafmt for Spark Connect Module Sub-task Resolved Martin Grund Actions
        42.
        Add as(alias: String) to connect DSL Sub-task Resolved Rui Wang Actions
        43.
        Add Deduplicate to Connect proto Sub-task Resolved Rui Wang Actions
        44.
        Add limit and offset to Connect DSL Sub-task Resolved Rui Wang Actions
        45.
        Python: rename LogicalPlan.collect to LogicalPlan.to_proto Sub-task Resolved Rui Wang Actions
        46.
        Add Intersect to Connect proto and DSL Sub-task Resolved Unassigned Actions
        47.
        Connect Proto should carry unparsed identifiers Sub-task Resolved Rui Wang Actions
        48.
        Drop Python test tables before and after unit tests Sub-task Resolved Rui Wang Actions
        49.
        AnalyzeResult should use struct for schema Sub-task Resolved Rui Wang Actions
        50.
        [Python] Implement `DataFrame.sample` Sub-task Resolved Ruifeng Zheng Actions
        51.
        Implement `DataFrame.summary` Sub-task Resolved Ruifeng Zheng Actions
        52.
        Change default serialization from 'broken' CSV to Spark DF JSON Sub-task Resolved Martin Grund Actions
        53.
        Allow configurable GPRC interceptors for Spark Connect Sub-task Resolved Martin Grund Actions
        54.
        Add .agg() to Connect DSL Sub-task Resolved Rui Wang Actions
        55.
        pin 'grpcio==1.48.1' 'protobuf==4.21.6' Sub-task Resolved Ruifeng Zheng Actions
        56.
        Support Join UsingColumns in proto Sub-task Resolved Rui Wang Actions
        57.
        Support Range in Connect proto Sub-task Resolved Rui Wang Actions
        58.
        UserContext should be extensible Sub-task Resolved Martin Grund Actions
        59.
        Mark internal API to be private[connect] Sub-task Resolved Rui Wang Actions
        60.
        Improve `on` in Join in Python client Sub-task Resolved Rui Wang Actions
        61.
        Reimplement `frequentItems` with dataframe operations Sub-task Resolved Ruifeng Zheng Actions
        62.
        Reimplement `crosstab` with dataframe operations Sub-task Resolved Ruifeng Zheng Actions
        63.
        Reimplement `summary` with dataframe operations Sub-task Resolved Ruifeng Zheng Actions
        64.
        Add a dedicated logical plan for `Summary` Sub-task Resolved Ruifeng Zheng Actions
        65.
        Refactor server side tests to only use DataFrame API Sub-task Resolved Rui Wang Actions
        66.
        Support Collect() in Python client Sub-task Resolved Rui Wang Actions
        67.
        Reimplement df.stat.{cov, corr} with built-in sql functions Sub-task Resolved Ruifeng Zheng Actions
        68.
        Support Alias for every Relation Sub-task Resolved Rui Wang Actions
        69.
        Implement `DataFrame.sortWithinPartitions` Sub-task Resolved Ruifeng Zheng Actions
        70.
        pyspark-connect tests should be skipped if pandas doesn't exist Sub-task Resolved Dongjoon Hyun Actions
        71.
        Add missing `limit(n)` in DataFrame.head Sub-task Resolved Ruifeng Zheng Actions
        72.
        Support List[ColumnRef] for Join's on argument. Sub-task Open Unassigned Actions
        73.
        Imports more from connect proto package to avoid calling `proto.` for Connect DSL Sub-task Resolved Rui Wang Actions
        74.
        Complete Support for Union in Python client Sub-task Resolved Rui Wang Actions
        75.
        Support session.sql in Connect DSL Sub-task Resolved Rui Wang Actions
        76.
        Support session.range in Python client Sub-task Resolved Rui Wang Actions
        77.
        Improve `session.sql` testing coverage in Python client Sub-task Resolved Rui Wang Actions
        78.
        Support toDF(columnNames) in Connect DSL Sub-task Resolved Rui Wang Actions
        79.
        Migrate markdown style README to python/docs/development/testing.rst Sub-task In Progress Unassigned Actions
        80.
        Developer Documentation for Spark Connect Sub-task Resolved Martin Grund Actions
        81.
        Connection string support for Python client Sub-task Resolved Martin Grund Actions
        82.
        Compatible `take`, `head` and `first` API in Python client Sub-task Resolved Rui Wang Actions
        83.
        Arrow based collect Sub-task Resolved Ruifeng Zheng Actions
        84.
        Complete Support for Except and Intersect in Python client Sub-task Resolved Rui Wang Actions
        85.
        Support Repartition in Connect DSL Sub-task Resolved Rui Wang Actions
        86.
        RemoteSparkSession should only accept one `user_id` Sub-task In Progress Unassigned Actions
        87.
        Connect DataFrame should require RemoteSparkSession Sub-task Resolved Rui Wang Actions
        88.
        `columns` API should use `schema` API to avoid data fetching Sub-task Resolved Rui Wang Actions
        89.
        Support CreateView in Connect DSL Sub-task Resolved Rui Wang Actions
        90.
        Support other data type conversion in the DataTypeProtoConverter Sub-task Resolved Unassigned Actions
        91.
        Removing unused code in connect Sub-task Resolved Deng Ziming Actions
        92.
        Support SelectExpr which apply Projection by expressions in Strings in Connect DSL Sub-task Resolved Rui Wang Actions
        93.
        Implement `DataFrame.crosstab` and `DataFrame.stat.crosstab` Sub-task Resolved Ruifeng Zheng Actions
        94.
        Implement `DataFrame.freqItems ` and `DataFrame.stat.freqItems ` Sub-task Open Ruifeng Zheng Actions
        95.
        Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy ` Sub-task Open Ruifeng Zheng Actions
        96.
        Implement `DataFrame.stat.cov` Sub-task Open Ruifeng Zheng Actions
        97.
        Implement `DataFrame.stat.corr` Sub-task Open Ruifeng Zheng Actions
        98.
        Implement `DataFrame.approxQuantile` and `DataFrame.stat.approxQuantile` Sub-task Open Ruifeng Zheng Actions
        99.
        Rename `ColumnRef` to `Column` in Python client implementation Sub-task Resolved Rui Wang Actions
        100.
        DataFrame `withColumnsRenamed` can be implemented through `RenameColumns` proto Sub-task Resolved Rui Wang Actions
        101.
        Merge SparkConnectPlanner and SparkConnectCommandPlanner Sub-task Resolved Rui Wang Actions
        102.
        Document how to add a new proto field of messages Sub-task Resolved Rui Wang Actions
        103.
        Adopt `optional` keyword from proto3 which offers `hasXXX` to differentiate if a field is set or unset Sub-task Resolved Rui Wang Actions
        104.
        Control the max size of arrow batch Sub-task Resolved Ruifeng Zheng Actions
        105.
        Implement `DataFrame.sparkSession` in Python client Sub-task Resolved Rui Wang Actions
        106.
        Implement `DataFrame.show` Sub-task Resolved Ruifeng Zheng Actions
        107.
        Support local data for LocalRelation Sub-task Resolved Deng Ziming Actions
        108.
        Add ClientType to proto to indicate which client sends a request Sub-task Resolved Rui Wang Actions
        109.
        Input relation can be optional for Project in Connect proto Sub-task Resolved Rui Wang Actions
        110.
        Explain API can support different modes Sub-task Resolved Rui Wang Actions
        111.
        Implement DataFrame.CreateGlobalView in Python client Sub-task Resolved Rui Wang Actions
        112.
        Implement `DataFrame.fillna ` and `DataFrame.na.fill ` Sub-task Resolved Ruifeng Zheng Actions
        113.
        Implement `DataFrame.dropna ` and `DataFrame.na.drop ` Sub-task Open Ruifeng Zheng Actions
        114.
        Show detailed differences in dataframe comparison Sub-task Resolved Ruifeng Zheng Actions
        115.
        Update relations.proto to follow Connect Proto development guidance Sub-task Resolved Rui Wang Actions
        116.
        Implement `DataFrame.drop` Sub-task Resolved Ruifeng Zheng Actions
        117.
        Homogenize the protobuf version across server and client Sub-task Resolved Martin Grund Actions
        118.
        Implement `DataFrame.SelectExpr` in Python client Sub-task Resolved Rui Wang Actions
        119.
        Dataframe.transform in Python client support Sub-task Resolved Martin Grund Actions
        120.
        Migrate custom exceptions to SparkException Sub-task In Progress Apache Spark Actions
        121.
        Implement `DataFrame.isEmpty` Sub-task Resolved Ruifeng Zheng Actions
        122.
        Implement `DataFrame.__repr__` and `DataFrame.dtypes` Sub-task Resolved Ruifeng Zheng Actions
        123.
        protoc-3.21.9-linux-x86_64.exe requires GLIBC_2.14 Sub-task Resolved Haonan Jiang Actions
        124.
        Make AnalyzePlan support multiple analysis tasks Sub-task Resolved Ruifeng Zheng Actions
        125.
        Unify the typing definitions Sub-task Resolved Ruifeng Zheng Actions
        126.
        Disable unsupported functions Sub-task Resolved Martin Grund Actions
        127.
        Implement `DataFrame.crossJoin` Sub-task In Progress Unassigned Actions
        128.
        Remove `str` from Aggregate expression type Sub-task Resolved Rui Wang Actions
        129.
        Support more datatypes Sub-task Resolved Ruifeng Zheng Actions
        130.
        Update the protobuf version in README Sub-task Resolved Xinrong Meng Actions
        131.
        DataFrame.to_pandas should not return optional pandas dataframe Sub-task Resolved Rui Wang Actions
        132.
        RemoteSparkSession should be called SparkSession Sub-task Resolved Martin Grund Actions
        133.
        Implement DataFrame.withColumn(s) Sub-task In Progress Unassigned Actions
        134.
        Upgrade buf to v1.9.0 Sub-task In Progress Unassigned Actions
        135.
        Make Literal support more datatypes Sub-task In Progress Unassigned Actions
        136.
        Check and upgrade buf.build/protocolbuffers/plugins/python to 3.19.5 Sub-task Open Unassigned Actions
        137.
        Refactor "Column" for API Compatibility Sub-task In Progress Unassigned Actions
        138.
        Add Catalog tableExists and namespaceExists in Connect proto Sub-task In Progress Unassigned Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            grundprinzip-db Martin Grund
            Herman van Hövell Herman van Hövell

            Dates

              Created:
              Updated:

              Slack

                Issue deployment