Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47540

SPIP: Pure Python Package (Spark Connect)

    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Critical
    • Resolution: Done
    • 4.0.0
    • 4.0.0
    • Connect, PySpark
    • None

    Description

      Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

      As part of the Spark Connect development, we have introduced Scala and Python clients. While the Scala client is already provided as a separate library and is available in Maven, the Python client is not. This proposal aims for end users to install the pure Python package for Spark Connect by using pip install pyspark-connect.

      The pure Python package contains only Python source code without jars, which reduces the size of the package significantly and widens the use cases of PySpark. See also Introducing Spark Connect - The Power of Apache Spark, Everywhere'.

      Q2. What problem is this proposal NOT designed to solve?

      This proposal does not aim to Change existing PySpark package, e.g., pip install pyspark is not affected

      • Implement full compatibility with classic PySpark, e.g., implementing RDD API
      • Address how to launch Spark Connect server. Spark Connect server is launched by users themselves
      • Local mode. Without launching Spark Connect server, users cannot use this package.
      • Official release channel is not affected but only PyPI.

      Q3. How is it done today, and what are the limits of current practice?

      Currently, we run pip install pyspark, and it is over 300MB because of dependent jars. In addition, PySpark requires you to set up other environments such as JDK installation.
      This is not suitable when the running environment and resource is limited such as edge devices such as smart home devices.
      Requiring a non-Python environment is not Python friendly.

      Q4. What is new in your approach and why do you think it will be successful?

      It provides a pure Python library, which eliminates other environment requirements such as JDK, and reduces the resource usage by decoupling Spark Driver, and reduces the package size.

      Q5. Who cares? If you are successful, what difference will it make?

      Users who want to leverage Spark in the limited environment, and want to decouple running JVM with Spark Driver to run Spark as a Service. They can simply pip install pyspark-connect that does not require other dependencies (except Python dependencies just like other Python libraries).

      Q6. What are the risks?

      Because we do not change the existing PySpark package, I do not see any major risk in classic PySpark itself. We will reuse the same Python source, and therefore we should make sure no Py4J is used, and no JVM access is made. This requirement might confuse the developers. At the very least, we should add the dedicated CI to make sure the pure Python package works.

      Q7. How long will it take?

      I expect around one month including CI set up. In fact, the prototype is ready so I expect this to be done sooner.

      Q8. What are the mid-term and final “exams” to check for success?

      The mid-term goal is to set up a scheduled CI job that builds the pure Python library, and runs all the tests against them.
      The final goral would be to properly test end-to-end usecase from pip installation.

      Attachments

        Issue Links

          1.
          Separate pure Python packaging Sub-task Resolved Hyukjin Kwon
          2.
          Add an environment variable for testing remote pure Python library Sub-task Resolved Hyukjin Kwon
          3.
          Set up the CI for pyspark-connect package Sub-task Resolved Hyukjin Kwon
          4.
          Make SparkConf to root level to for both SparkSession and SparkContext Sub-task Resolved Hyukjin Kwon
          5.
          Get the proper default port for pyspark-connect testcases Sub-task Resolved Hyukjin Kwon
          6.
          Make pyspark.testing.connectutils compatible with pyspark-connect Sub-task Resolved Hyukjin Kwon
          7.
          Make pyspark.worker_utils compatible with pyspark-connect Sub-task Resolved Hyukjin Kwon
          8.
          Make pyspark.pandas compatible with pyspark-connect Sub-task Resolved Hyukjin Kwon
          9.
          Make pyspark.testing compatible with pyspark-connect Sub-task Resolved Hyukjin Kwon
          10.
          Reeanble UDFProfilerParityTests for pyspark-connect Sub-task Resolved Hyukjin Kwon
          11.
          Reeanble MemoryProfilerParityTests for pyspark-connect Sub-task Resolved Hyukjin Kwon
          12.
          Make pyspark.ml compatible with pyspark-connect Sub-task Resolved Hyukjin Kwon
          13.
          Make pyspark.ml.connect tests running without optional dependencies Sub-task Resolved Hyukjin Kwon
          14.
          Run ML tests for pyspark-connect package Sub-task Resolved Hyukjin Kwon
          15.
          Run Pandas API on Spark for pyspark-connect package Sub-task Resolved Hyukjin Kwon
          16.
          Change release script to release pyspark-connect Sub-task Resolved Hyukjin Kwon
          17.
          Document pyspark-connect package Sub-task Resolved Hyukjin Kwon
          18.
          Reeanble Avro function doctests Sub-task Resolved Hyukjin Kwon
          19.
          Reeanble Protobuf function doctests Sub-task Resolved Hyukjin Kwon
          20.
          Reeanble ResourceProfileTests for pyspark-connect Sub-task Resolved Hyukjin Kwon
          21.
          Make pyspark.resource compatible with pyspark-connect Sub-task Resolved Hyukjin Kwon
          22.
          Hide SQLContext and HiveContext Sub-task Resolved Hyukjin Kwon

          Activity

            People

              gurwls223 Hyukjin Kwon
              gurwls223 Hyukjin Kwon
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: