Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47540

SPIP: Pure Python Package (Spark Connect)

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Critical
    • Resolution: Done
    • 4.0.0
    • 4.0.0
    • Connect, PySpark
    • None

    Description

      Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

      As part of the Spark Connect development, we have introduced Scala and Python clients. While the Scala client is already provided as a separate library and is available in Maven, the Python client is not. This proposal aims for end users to install the pure Python package for Spark Connect by using pip install pyspark-connect.

      The pure Python package contains only Python source code without jars, which reduces the size of the package significantly and widens the use cases of PySpark. See also Introducing Spark Connect - The Power of Apache Spark, Everywhere'.

      Q2. What problem is this proposal NOT designed to solve?

      This proposal does not aim to Change existing PySpark package, e.g., pip install pyspark is not affected

      • Implement full compatibility with classic PySpark, e.g., implementing RDD API
      • Address how to launch Spark Connect server. Spark Connect server is launched by users themselves
      • Local mode. Without launching Spark Connect server, users cannot use this package.
      • Official release channel is not affected but only PyPI.

      Q3. How is it done today, and what are the limits of current practice?

      Currently, we run pip install pyspark, and it is over 300MB because of dependent jars. In addition, PySpark requires you to set up other environments such as JDK installation.
      This is not suitable when the running environment and resource is limited such as edge devices such as smart home devices.
      Requiring a non-Python environment is not Python friendly.

      Q4. What is new in your approach and why do you think it will be successful?

      It provides a pure Python library, which eliminates other environment requirements such as JDK, and reduces the resource usage by decoupling Spark Driver, and reduces the package size.

      Q5. Who cares? If you are successful, what difference will it make?

      Users who want to leverage Spark in the limited environment, and want to decouple running JVM with Spark Driver to run Spark as a Service. They can simply pip install pyspark-connect that does not require other dependencies (except Python dependencies just like other Python libraries).

      Q6. What are the risks?

      Because we do not change the existing PySpark package, I do not see any major risk in classic PySpark itself. We will reuse the same Python source, and therefore we should make sure no Py4J is used, and no JVM access is made. This requirement might confuse the developers. At the very least, we should add the dedicated CI to make sure the pure Python package works.

      Q7. How long will it take?

      I expect around one month including CI set up. In fact, the prototype is ready so I expect this to be done sooner.

      Q8. What are the mid-term and final “exams” to check for success?

      The mid-term goal is to set up a scheduled CI job that builds the pure Python library, and runs all the tests against them.
      The final goral would be to properly test end-to-end usecase from pip installation.

      Attachments

        Issue Links

        1.
        Separate pure Python packaging Sub-task Resolved Hyukjin Kwon Actions
        2.
        Add an environment variable for testing remote pure Python library Sub-task Resolved Hyukjin Kwon Actions
        3.
        Set up the CI for pyspark-connect package Sub-task Resolved Hyukjin Kwon Actions
        4.
        Make SparkConf to root level to for both SparkSession and SparkContext Sub-task Resolved Hyukjin Kwon Actions
        5.
        Get the proper default port for pyspark-connect testcases Sub-task Resolved Hyukjin Kwon Actions
        6.
        Make pyspark.testing.connectutils compatible with pyspark-connect Sub-task Resolved Hyukjin Kwon Actions
        7.
        Make pyspark.worker_utils compatible with pyspark-connect Sub-task Resolved Hyukjin Kwon Actions
        8.
        Make pyspark.pandas compatible with pyspark-connect Sub-task Resolved Hyukjin Kwon Actions
        9.
        Make pyspark.testing compatible with pyspark-connect Sub-task Resolved Hyukjin Kwon Actions
        10.
        Reeanble UDFProfilerParityTests for pyspark-connect Sub-task Resolved Hyukjin Kwon Actions
        11.
        Reeanble MemoryProfilerParityTests for pyspark-connect Sub-task Resolved Hyukjin Kwon Actions
        12.
        Make pyspark.ml compatible with pyspark-connect Sub-task Resolved Hyukjin Kwon Actions
        13.
        Make pyspark.ml.connect tests running without optional dependencies Sub-task Resolved Hyukjin Kwon Actions
        14.
        Run ML tests for pyspark-connect package Sub-task Resolved Hyukjin Kwon Actions
        15.
        Run Pandas API on Spark for pyspark-connect package Sub-task Resolved Hyukjin Kwon Actions
        16.
        Change release script to release pyspark-connect Sub-task Resolved Hyukjin Kwon Actions
        17.
        Document pyspark-connect package Sub-task Resolved Hyukjin Kwon Actions
        18.
        Reeanble Avro function doctests Sub-task Resolved Hyukjin Kwon Actions
        19.
        Reeanble Protobuf function doctests Sub-task Resolved Hyukjin Kwon Actions
        20.
        Reeanble ResourceProfileTests for pyspark-connect Sub-task Resolved Hyukjin Kwon Actions
        21.
        Make pyspark.resource compatible with pyspark-connect Sub-task Resolved Hyukjin Kwon Actions
        22.
        Hide SQLContext and HiveContext Sub-task Resolved Hyukjin Kwon Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            gurwls223 Hyukjin Kwon
            gurwls223 Hyukjin Kwon
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment